Machine Learning Systems
Machine Learning Systems
Machine Learning
Systems Vijay
Janapa Reddi
Machine Learning Systems
Preface i
Global Outreach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Why We Wrote This Book . . . . . . . . . . . . . . . . . . . . . . . . . i
Want to Help Out? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
What’s Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
FRONTMATTER iii
Author’s Note v
Book Changelog xi
Acknowledgements xiii
Funding Agencies and Companies . . . . . . . . . . . . . . . . . . . . xiii
Academic Support . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Non-Profit and Institutional Support . . . . . . . . . . . . . . . xiii
Corporate Support . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
SocratiQ AI xvii
i
Table of contents ii
MAIN xxix
Chapter 1 Introduction 1
1.1 AI Pervasiveness . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 AI and ML Basics . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 AI Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Symbolic AI Era . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Expert Systems Era . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Statistical Learning Era . . . . . . . . . . . . . . . . . . . 6
1.3.4 Shallow Learning Era . . . . . . . . . . . . . . . . . . . 7
1.3.5 Deep Learning Era . . . . . . . . . . . . . . . . . . . . . 8
1.4 ML Systems Engineering . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Defining ML Systems . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Lifecycle of ML Systems . . . . . . . . . . . . . . . . . . . . . . 13
1.7 ML Systems in the Wild . . . . . . . . . . . . . . . . . . . . . . 14
1.8 ML Systems Impact on Lifecycle . . . . . . . . . . . . . . . . . . 15
1.8.1 Emerging Trends . . . . . . . . . . . . . . . . . . . . . . 16
1.8.1.1 Application-Level Innovation . . . . . . . . . . 16
1.8.1.2 System Architecture Evolution . . . . . . . . . 16
1.9 Practical Applications . . . . . . . . . . . . . . . . . . . . . . . 17
1.9.1 FarmBeats: ML in Agriculture . . . . . . . . . . . . . . . 17
1.9.1.1 Data Considerations . . . . . . . . . . . . . . . 17
1.9.1.2 Algorithmic Considerations . . . . . . . . . . . 18
1.9.1.3 Infrastructure Considerations . . . . . . . . . . 18
1.9.1.4 Future Implications . . . . . . . . . . . . . . . 19
1.9.2 AlphaFold: Scientific ML . . . . . . . . . . . . . . . . . 19
1.9.2.1 Data Considerations . . . . . . . . . . . . . . . 20
1.9.2.2 Algorithmic Considerations . . . . . . . . . . . 20
1.9.2.3 Infrastructure Considerations . . . . . . . . . . 20
1.9.2.4 Future Implications . . . . . . . . . . . . . . . 21
1.9.3 Autonomous Vehicles and ML . . . . . . . . . . . . . . 21
Table of contents iii
Chapter 2 ML Systems 27
Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Cloud-Based Machine Learning . . . . . . . . . . . . . . . . . . 31
2.2.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.2 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.4 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3 Edge Machine Learning . . . . . . . . . . . . . . . . . . . . . . 37
2.3.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.2 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.4 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Mobile Machine Learning . . . . . . . . . . . . . . . . . . . . . 39
2.4.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.2 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.4 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Tiny Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.2 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.4 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 Hybrid Machine Learning . . . . . . . . . . . . . . . . . . . . . 44
2.6.1 Design Patterns . . . . . . . . . . . . . . . . . . . . . . . 45
2.6.1.1 Train-Serve Split . . . . . . . . . . . . . . . . . 45
2.6.1.2 Hierarchical Processing . . . . . . . . . . . . . 45
2.6.1.3 Progressive Deployment . . . . . . . . . . . . 46
2.6.1.4 Federated Learning . . . . . . . . . . . . . . . 46
2.6.1.5 Collaborative Learning . . . . . . . . . . . . . 46
2.6.2 Real-World Integration . . . . . . . . . . . . . . . . . . . 46
2.7 Shared Principles . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.7.1 Implementation Layer . . . . . . . . . . . . . . . . . . . 50
2.7.2 System Principles Layer . . . . . . . . . . . . . . . . . . 50
2.7.3 System Considerations Layer . . . . . . . . . . . . . . . 51
2.7.4 Principles to Practice . . . . . . . . . . . . . . . . . . . . 52
Table of contents iv
Chapter 3 DL Primer 59
Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2 The Evolution to Deep Learning . . . . . . . . . . . . . . . . . . 61
3.2.1 Rule-Based Programming . . . . . . . . . . . . . . . . . 61
3.2.2 Classical Machine Learning . . . . . . . . . . . . . . . . 63
3.2.3 Neural Networks and Representation Learning . . . . . 64
3.2.4 Neural System Implications . . . . . . . . . . . . . . . . 65
3.2.4.1 Computation Patterns . . . . . . . . . . . . . . 66
3.2.4.2 Memory Systems . . . . . . . . . . . . . . . . . 66
3.2.4.3 System Scaling . . . . . . . . . . . . . . . . . . 66
3.3 Biological to Artificial Neurons . . . . . . . . . . . . . . . . . . 67
3.3.1 Biological Intelligence . . . . . . . . . . . . . . . . . . . 67
3.3.2 Transition to Artificial Neurons . . . . . . . . . . . . . . 68
3.3.3 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . 69
3.3.4 Computational Translation . . . . . . . . . . . . . . . . 70
3.3.5 System Requirements . . . . . . . . . . . . . . . . . . . 71
3.3.6 Evolution and Impact . . . . . . . . . . . . . . . . . . . 72
3.4 Neural Network Fundamentals . . . . . . . . . . . . . . . . . . 74
3.4.1 Basic Architecture . . . . . . . . . . . . . . . . . . . . . 74
3.4.1.1 Neurons and Activations . . . . . . . . . . . . 74
3.4.1.2 Layers and Connections . . . . . . . . . . . . . 76
3.4.1.3 Data Flow and Transformations . . . . . . . . 76
3.4.2 Weights and Biases . . . . . . . . . . . . . . . . . . . . . 77
3.4.2.1 Weight Matrices . . . . . . . . . . . . . . . . . 77
3.4.2.2 Connection Patterns . . . . . . . . . . . . . . . 78
3.4.2.3 Bias Terms . . . . . . . . . . . . . . . . . . . . 79
3.4.2.4 Parameter Organization . . . . . . . . . . . . . 79
3.4.3 Network Topology . . . . . . . . . . . . . . . . . . . . . 79
3.4.3.1 Basic Structure . . . . . . . . . . . . . . . . . . 79
3.4.3.2 Design Trade-offs . . . . . . . . . . . . . . . . 80
3.4.3.3 Connection Patterns . . . . . . . . . . . . . . . 81
3.4.3.4 Parameter Considerations . . . . . . . . . . . . 82
3.5 Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.5.1 Training Overview . . . . . . . . . . . . . . . . . . . . . 83
3.5.2 Forward Propagation . . . . . . . . . . . . . . . . . . . 84
3.5.2.1 Layer Computation . . . . . . . . . . . . . . . 84
3.5.2.2 Mathematical Representation . . . . . . . . . . 85
3.5.2.3 Computational Process . . . . . . . . . . . . . 86
3.5.2.4 Practical Considerations . . . . . . . . . . . . . 86
3.5.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . 87
3.5.3.1 Basic Concepts . . . . . . . . . . . . . . . . . . 88
3.5.3.2 Classification Losses . . . . . . . . . . . . . . . 88
Table of contents v
LABS 1087
Overview 1089
Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . .1089
Target Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1089
Supported Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1090
Lab Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1090
Recommended Lab Sequence . . . . . . . . . . . . . . . . . . . . . . .1091
Troubleshooting and Support . . . . . . . . . . . . . . . . . . . . . . .1091
Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1091
Setup 1099
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1099
Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1100
Two Parallel Cores . . . . . . . . . . . . . . . . . . . . . . . . .1100
Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1100
Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1101
Arduino IDE Installation . . . . . . . . . . . . . . . . . . . . . . . . .1101
Testing the Microphone . . . . . . . . . . . . . . . . . . . . . .1102
Testing the IMU . . . . . . . . . . . . . . . . . . . . . . . . . . .1102
Testing the ToF (Time of Flight) Sensor . . . . . . . . . . . . . .1103
Testing the Camera . . . . . . . . . . . . . . . . . . . . . . . . .1104
Installing the OpenMV IDE . . . . . . . . . . . . . . . . . . . . . . . .1105
Updating the Bootloader . . . . . . . . . . . . . . . . . .1106
Installing the Firmware . . . . . . . . . . . . . . . . . .1106
Testing the Camera . . . . . . . . . . . . . . . . . . . . .1109
Connecting the Nicla Vision to Edge Impulse Studio . . . . . . . . . .1110
Expanding the Nicla Vision Board (optional) . . . . . . . . . . . . . .1112
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1116
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1116
Setup 1205
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1205
Installing the XIAO ESP32S3 Sense on Arduino IDE . . . . . . . . . .1207
Testing the board with BLINK . . . . . . . . . . . . . . . . . . . . . .1208
Connecting Sense module (Expansion Board) . . . . . . . . . . . . . .1209
Microphone Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1209
Testing the Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1212
Testing WiFi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1212
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1218
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1218
Raspberry Pi 1367
Pre-requisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1367
Table of contents xxxv
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1368
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1368
Setup 1369
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1370
Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1370
Raspberry Pi Models (covered in this book) . . . . . . . . . . .1370
Engineering Applications . . . . . . . . . . . . . . . . . . . . .1370
Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . .1371
Raspberry Pi Zero 2W . . . . . . . . . . . . . . . . . . . . . . .1371
Raspberry Pi 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . .1372
Installing the Operating System . . . . . . . . . . . . . . . . . . . . .1372
The Operating System (OS) . . . . . . . . . . . . . . . . . . . .1372
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1373
Initial Configuration . . . . . . . . . . . . . . . . . . . . . . . .1375
Remote Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1375
SSH Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1375
To shut down the Raspi via terminal: . . . . . . . . . . . . . . .1376
Transfer Files between the Raspi and a computer . . . . . . . .1377
Using Secure Copy Protocol (scp): . . . . . . . . . . . .1377
Transferring files using FTP . . . . . . . . . . . . . . . .1379
Increasing SWAP Memory . . . . . . . . . . . . . . . . . . . . . . . .1379
Installing a Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . .1381
Installing a USB WebCam . . . . . . . . . . . . . . . . . . . . .1381
Video Streaming . . . . . . . . . . . . . . . . . . . . . .1384
Installing a Camera Module on the CSI port . . . . . . . . . . .1385
Running the Raspi Desktop remotely . . . . . . . . . . . . . . . . . .1388
Updating and Installing Software . . . . . . . . . . . . . . . . . . . .1391
Model-Specific Considerations . . . . . . . . . . . . . . . . . . . . . .1392
Raspberry Pi Zero (Raspi-Zero) . . . . . . . . . . . . . . . . . .1392
Raspberry Pi 4 or 5 (Raspi-4 or Raspi-5) . . . . . . . . . . . . .1392
APPENDIX 1619
PhD Survival Guide 1621
REFERENCES 1625
References 1627
Preface
Global Outreach
Thank you to all our readers and visitors. Your engagement with the material
keeps us motivated.
i
Want to Help Out? ii
What’s Next?
If you’re ready to dive deeper into the book’s structure, learning objectives, and
practical use, visit the About the Book section for more details.
FRONTMATTER
iii
Author’s Note
AI is bound to transform the world in profound ways, much like computers and
the Internet revolutionized every aspect of society in the 20th century. From
systems that generate creative content to those driving breakthroughs in drug
discovery, AI is ushering in a new era—one that promises to be even more
transformative in its scope and impact. But how do we make it accessible to
everyone?
With its transformative power comes an equally great responsibility for those
who access it or work with it. Just as we expect companies to wield their
influence ethically, those of us in academia bear a parallel responsibility: to
share our knowledge openly, so it benefits everyone—not just a select few. This
conviction inspired the creation of this book—an open-source resource aimed
at making AI education, particularly in AI engineering, and systems, inclusive,
and accessible to everyone from all walks of life.
My passion for creating, curating, and editing this content has been deeply
influenced by landmark textbooks that have profoundly shaped both my aca-
demic and personal journey. Whether I studied them cover to cover or drew
insights from key passages, these resources fundamentally shaped the way
I think. I reflect on the books that guided my path: works by Turing Award
winners such as David Patterson and John Hennessy—pioneers in computer
architecture and system design—and foundational research papers by luminar-
ies like Yann LeCun, Geoffrey Hinton, and Yoshua Bengio. In some small part,
my hope is that this book will inspire students to chart their own unique paths.
I am optimistic about what lies ahead for AI. It has the potential to solve global
challenges and unlock creativity in ways we have yet to imagine. To achieve this,
however, we must train the next generation of AI engineers and practitioners—
those who can transform novel AI algorithms into working systems that enable
real-world application. This book is a step toward curating the material needed
to build the next generation of AI engineers who will transform today’s visions
into tomorrow’s reality.
This book is a work in progress, but knowing that even one learner benefits
from its content motivates me to continually refine and expand it. To that end, if
there’s one thing I ask of readers, it’s this: please show your support by starring
the GitHub repository here. Your star � reflects your belief in this mission—not
just to me, but to the growing global community of learners, educators, and
practitioners. This small act is more than symbolic—it amplifies the importance
of making AI education accessible.
v
Author’s Note vi
I am a student of my own writing, and every chapter of this book has taught
me something new—thanks to the numerous people who have played, and
continue to play, an important role in shaping this work. Professors, students,
practitioners, and researchers contributed by offering suggestions, sharing ex-
pertise, identifying errors, and proposing improvements. Every interaction,
whether a detailed critique or a simple correction from a GitHub contributor,
has been a lesson in itself. These contributions have not only refined the material
but also deepened my understanding of how knowledge grows through col-
laboration. This book is, therefore, not solely my work; it is a shared endeavor,
reflecting the collective spirit of those dedicated to sharing their knowledge
and effort.
This book is dedicated to the loving memory of my father. His passion for
education, endless curiosity, generosity in sharing knowledge, and unwavering
commitment to quality challenge me daily to strive for excellence in all I do. In
his honor, I extend this dedication to teachers and mentors everywhere, whose
efforts and guidance transform lives every day. Your selfless contributions
remind me to persevere.
Last but certainly not least, this work would not be possible without the
unwavering support of my wonderful wife and children. Their love, patience,
and encouragement form the foundation that enables me to pursue my passion
and bring this work to life. For this, and so much more, I am deeply grateful.
— Prof. Vijay Janapa Reddi
About the Book
Overview
Purpose of the Book
Welcome to this collaborative textbook. It originated as part of the CS249r:
Tiny Machine Learning course that Prof. Vijay Janapa Reddi teaches at Harvard
University.
The goal of this book is to provide a resource for educators and learners
seeking to understand the principles and practices of machine learning systems.
This book is continually updated to incorporate the latest insights and effective
teaching strategies with the intent that it remains a valuable resource in this
fast-evolving field. So please check back often!
What to Expect
This textbook explores the foundational principles, practical workflows, and
critical challenges of building and deploying machine learning systems. Starting
with foundational concepts, it progresses through engineering principles,
examines operational considerations for deploying AI systems, and concludes
by reflecting on the societal and technological implications of machine learning.
vii
Learning Goals viii
Learning Goals
Key Learning Outcomes
This book is structured with Bloom’s Taxonomy in mind, which defines six
levels of learning, ranging from foundational knowledge to advanced creative
thinking:
Learning Objectives
This book supports readers in:
1. Understanding Fundamentals: Explain the foundational principles of
machine learning, including theoretical underpinnings and practical ap-
plications.
2. Analyzing System Components: Evaluate the critical components of AI
systems and their roles within various architectures.
3. Designing Workflows: Outline workflows for developing machine learn-
ing systems, from data collection to deployment.
4. Optimizing Models: Apply methods to enhance performance, such as
hyperparameter tuning and regularization.
5. Evaluating Ethical Implications: Analyze societal impacts and address
potential biases in AI systems.
About the Book ix
AI Learning Companion
Throughout this resource, you’ll find SocratiQ—an AI learning assistant de-
signed to enhance your learning experience. Inspired by the Socratic method
of teaching, SocratiQ combines interactive quizzes, personalized assistance,
and real-time feedback to help you reinforce your understanding and create
new connections. As part of our experiment with Generative AI technologies,
SocratiQ encourages critical thinking and active engagement with the material.
SocratiQ is still a work in progress, and we welcome your feedback to make
it better. For more details about how SocratiQ works and how to get the most
out of it, visit the AI Learning Companion page.
Modular Design
The book is modular, allowing readers to explore chapters independently or
sequentially. Each chapter includes supplementary resources:
Transparency and Collaboration x
xi
Acknowledgements
This book, inspired by the TinyML edX course and CS294r at Harvard Univer-
sity, is the result of years of hard work and collaboration with many students,
researchers and practitioners. We are deeply indebted to the folks whose
groundbreaking work laid its foundation.
As our understanding of machine learning systems deepened, we realized
that fundamental principles apply across scales, from tiny embedded systems to
large-scale deployments. This realization shaped the book’s expansion into an
exploration of machine learning systems with the aim of providing a foundation
applicable across the spectrum of implementations.
xiii
Contributors xiv
Corporate Support
The following companies contributed hardware kits used for the labs in this
book and/or supported the development of hands-on educational materials:
Contributors
We express our sincere gratitude to the open-source community of learners,
educators, and contributors. Each contribution, whether a chapter section or a
single-word correction, has significantly enhanced the quality of this resource.
We also acknowledge those who have shared insights, identified issues, and
provided valuable feedback behind the scenes.
A comprehensive list of all GitHub contributors, automatically updated with
each new contribution, is available below. For those interested in contributing
further, please consult our GitHub page for more information.
Vijay Janapa Reddi
jasonjabbour
Ikechukwu Uchendu
Zeljko Hrcek
Kai Kleinbard
Naeem Khoshnevis
Marcelo Rovai
Sara Khosravi
Douwe den Blanken
shanzehbatool
Elias
Jared Ping
Jeffrey Ma
Itai Shapira
Maximilian Lam
Jayson Lin
Sophia Cho
Andrea
Acknowledgements xv
Alex Rodriguez
Korneel Van den Berghe
Zishen Wan
Colby Banbury
Mark Mazumder
Divya Amirtharaj
Abdulrahman Mahmoud
Srivatsan Krishnan
marin-llobet
Emeka Ezike
Aghyad Deeb
Haoran Qiu
Aditi Raju
ELSuitorHarvard
Emil Njor
Michael Schnebly
Jared Ni
oishib
Yu-Shun Hsiao
Jae-Won Chung
Henry Bae
Jennifer Zhou
Arya Tschand
Eura Nofshin
Pong Trairatvorakul
Matthew Stewart
Marco Zennaro
Andrew Bass
Shvetank Prakash
Fin Amin
Allen-Kuang
Gauri Jain
gnodipac886
The Random DIY
Bruno Scaglione
Fatima Shah
Sercan Aygün
Alex Oesterling
Baldassarre Cesarano
Abenezer
TheHiddenLayer
abigailswallow
yanjingl
happyappledog
Yang Zhou
Aritra Ghosh
Andy Cheng
Bilge Acun
Contributors xvi
Jessica Quaye
Jason Yik
Emmanuel Rassou
Shreya Johri
Sonia Murthy
Vijay Edupuganti
Costin-Andrei Oncescu
Annie Laurie Cook
Jothi Ramaswamy
Batur Arslan
Curren Iyer
Fatima Shah
Edward Jin
a-saraf
songhan
Zishen
SocratiQ AI
AI Learning Companion
Welcome to SocratiQ (pronounced “Socratic’ ’), an AI learning assistant seam-
lessly integrated throughout this resource. Inspired by the Socratic method of
teaching—emphasizing thoughtful questions and answers to stimulate critical
thinking—SocratiQ is part of our experiment with what we call as Generative
Learning. By combining interactive quizzes, personalized assistance, and real-
time feedback, SocratiQ is meant to reinforce your understanding and help you
create new connections. SocratiQ is still a work in progress, and we welcome your
feedback.
Learn more: Read our research paper on SocratiQ’s design and pedagogy
here.
Listen to this AI-generated podcast about SocratiQ here.
You can enable SocratiQ by clicking the button below:
SocratiQ: OFF
xvii
Button Overview xviii
Ĕʼn Warning
Once you’ve enabled SocratiQ it will always be available when you visit this
site.
You can access SocratiQ at any time using a keyboard shortcut shown in
Figure 0.2, which brings up the interface shown in Figure 0.3.
Button Overview
The top nav bar provides quick access to the following features:
SocratiQ AI xix
These quizzes are conveniently inserted at the end of every major subsection
(e.g., 1.1, 1.2, 1.3, and so on), as illustrated in Figure 0.7.
Each quiz typically consists of 3-5 multiple-choice questions and takes only
1-2 minutes to complete. These questions are designed to assess your un-
derstanding of the material covered in the preceding section, as shown in
Figure 0.8a.
SocratiQ AI xxi
When you encounter challenging concepts, SocratiQ offers two powerful ways
to get help. First, you can select any text from the textbook and ask for a detailed
explanation, as demonstrated in Figure 0.9.
Once you’ve selected the text, you can ask questions about it, and SocratiQ will
provide detailed explanations based on that context, as illustrated in Figure 0.10.
Figure 0.12 shows the response for the ask in Figure 0.10.
SocratiQ AI xxiii
Additionally, you can also reference Sections, as shown in Figure 0.11, Sub-
sections and keywords directly as you converse with SocratiQ. Use the @ symbol
to reference a section, sub-section or keyword. You can also click the + Context
button right above the input.
As you continue to engage with the material and complete quizzes, you’ll
earn various badges that recognize your progress, as shown in Figure 0.15.
Achievement Badges
As you progress through the quizzes, you’ll earn special badges to mark your
achievements! Here’s what you can earn:
L Tip
Keep taking quizzes to collect all badges and improve your learning
journey! Your current badges will appear in the quiz statistics dashboard.
If you’d like a record of your progress you can generate a PDF report. It
will show your progress, average performance and all the questions you’ve
attempted. The PDF is a generated with a unique hash and can be uniquely
validated.
Data Storage
çĖ Important
You can also delete all of your saved conversations by clicking the New Chat
button in the nav bar.
Technical Requirements xxvi
Technical Requirements
To use SocratiQ effectively, you’ll need:
• Chrome or Safari browser
• JavaScript enabled
• Stable internet connection
Providing Feedback
Your feedback helps us improve SocratiQ.
SocratiQ AI xxvii
xxix
Chapter 1
Introduction
1.1 AI Pervasiveness
Artificial Intelligence (AI) has emerged as one of the most transformative forces
in human history. From the moment we wake up to when we go to sleep, AI
systems invisibly shape our world. They manage trafÏc flows in our cities, opti-
mize power distribution across electrical grids, and enable billions of wireless
devices to communicate seamlessly. In hospitals, AI analyzes medical images
and helps doctors diagnose diseases. In research laboratories, it accelerates
scientific discovery by simulating molecular interactions and processing vast
datasets from particle accelerators. In space exploration, it helps rovers navigate
distant planets and telescopes detect new celestial phenomena.
Throughout history, certain technologies have fundamentally transformed
human civilization, defining their eras. The 18th and 19th centuries were
shaped by the Industrial Revolution, where steam power and mechanization
transformed how humans could harness physical energy. The 20th century was
1
1.2. AI and ML Basics 2
defined by the Digital Revolution, where the computer and internet transformed
how we process and share information. Now, the 21st century appears to be the
era of Artificial Intelligence, a shift noted by leading thinkers in technological
evolution (Brynjolfsson and McAfee 2014; Domingos 2016).
The vision driving AI development extends far beyond the practical appli-
cations we see today. We aspire to create systems that can work alongside
humanity, enhancing our problem-solving capabilities and accelerating scien-
tific progress. Imagine AI systems that could help us understand consciousness,
decode the complexities of biological systems, or unravel the mysteries of dark
matter. Consider the potential of AI to help address global challenges like
climate change, disease, or sustainable energy production. This is not just
about automation or efÏciency—it’s about expanding the boundaries of human
knowledge and capability.
The impact of this revolution operates at multiple scales, each with pro-
found implications. At the individual level, AI personalizes our experiences
and augments our daily decision-making capabilities. At the organizational
level, it transforms how businesses operate and how research institutions make
discoveries. At the societal level, it reshapes everything from transportation
systems to healthcare delivery. At the global level, it offers new approaches
to addressing humanity’s greatest challenges, from climate change to drug
discovery.
What makes this transformation unique is its unprecedented pace. While the
Industrial Revolution unfolded over centuries and the Digital Revolution over
decades, AI capabilities are advancing at an extraordinary rate. Technologies
that seemed impossible just years ago, including systems that can understand
human speech, generate novel ideas, or make complex decisions, are now
commonplace. This acceleration suggests we are only beginning to understand
how profoundly AI will reshape our world.
We stand at a historic inflection point. Just as the Industrial Revolution
required us to master mechanical engineering to harness the power of steam
and machinery, and the Digital Revolution demanded expertise in electrical
and computer engineering to build the internet age, the AI Revolution presents
us with a new engineering challenge. We must learn to build systems that
can learn, reason, and potentially achieve superhuman capabilities in specific
domains.
1.3 AI Evolution
The evolution of AI, depicted in the timeline shown in Figure 1.2, highlights
2
key milestones such as the development of the perceptron2 in 1957 by Frank
Perceptron: The first artificial Rosenblatt, a foundational element for modern neural networks. Imagine
neural network—a simple model walking into a computer lab in 1965. You’d find room-sized mainframes running
that could learn to classify visual programs that could prove basic mathematical theorems or play simple games
patterns, similar to a single neuron like tic-tac-toe. These early artificial intelligence systems, while groundbreaking
making a yes/no decision based on for their time, were a far cry from today’s machine learning systems that can
its inputs. detect cancer in medical images or understand human speech. The timeline
shows the progression from early innovations like the ELIZA chatbot in 1966, to
significant breakthroughs such as IBM’s Deep Blue defeating chess champion
Garry Kasparov in 1997. More recent advancements include the introduction
of OpenAI’s GPT-3 in 2020 and GPT-4 in 2023, demonstrating the dramatic
evolution and increasing complexity of AI systems over the decades.
STUDENT would:
By the mid-1970s, researchers realized that general AI was too ambitious. In-
stead, they focused on capturing human expert knowledge in specific domains.
MYCIN, developed at Stanford, was one of the first large-scale expert systems
designed to diagnose blood infections.
1.3. AI Evolution 6
While MYCIN represented a major advance in medical AI with its 600 expert
rules for diagnosing blood infections, it revealed fundamental challenges that
still plague ML today. Getting domain knowledge from human experts and
converting it into precise rules proved incredibly time-consuming and difÏcult—
doctors often couldn’t explain exactly how they made decisions. MYCIN strug-
gled with uncertain or incomplete information, unlike human doctors who
could make educated guesses. Perhaps most importantly, maintaining and
updating the rule base became exponentially more complex as MYCIN grew, as
adding new rules frequently conflicted with existing ones, while medical knowl-
edge itself continued to evolve. These same challenges of knowledge capture,
uncertainty handling, and maintenance remain central concerns in modern
machine learning, even though we now use different technical approaches to
address them.
Statistical (1990s):
P(spam|word) = (frequency in spam emails) / (total frequency)
Chapter 1. Introduction 7
The table serves as a bridge between the early approaches we’ve discussed
and the more recent developments in shallow and deep learning that we’ll
explore next. It sets the stage for understanding why certain approaches gained
prominence in different eras and how each new paradigm built upon and
addressed the limitations of its predecessors. Moreover, it illustrates how the
strengths of earlier approaches continue to influence and enhance modern AI
techniques, particularly in the era of foundation models.
from the work of crews below them. In contrast, shallow learning typically had
just one or two levels of processing, similar to having just a foundation crew
and a framing crew.
During this time, several powerful algorithms dominated the machine learn-
ing landscape. Each brought unique strengths to different problems: Decision
trees provided interpretable results by making choices much like a flowchart.
K-nearest neighbors made predictions by finding similar examples in past data,
like asking your most experienced neighbors for advice. Linear and logistic
regression offered straightforward, interpretable models that worked well for
many real-world problems. Support Vector Machines (SVMs) excelled at find-
ing complex boundaries between categories using the “kernel trick”—imagine
being able to untangle a bowl of spaghetti into straight lines by lifting it into a
higher dimension. These algorithms formed the foundation of practical ma-
chine learning.
Consider a typical computer vision solution from 2005:
What made this era distinct was its hybrid approach: human-engineered
features combined with statistical learning. They had strong mathematical
foundations (researchers could prove why they worked). They performed well
even with limited data. They were computationally efÏcient. They produced
reliable, reproducible results.
Take the example of face detection, where the Viola-Jones algorithm (2001)
achieved real-time performance using simple rectangular features and a cascade
of classifiers. This algorithm powered digital camera face detection for nearly a
decade.
The success of AlexNet wasn’t just a technical achievement; it was a water- network from 2012 that won the Ima-
shed moment that demonstrated the practical viability of deep learning. It geNet competition by a large margin
showed that with sufÏcient data, computational power, and architectural in- and helped spark the deep learning
shallow learning methods that had dominated the field for decades. This single
result triggered an explosion of research and applications in deep learning that
continues to this day.
Watch on YouTube The field had to wait for the convergence of big data, better computing hard-
Convolutional Net Demo
ware, and algorithmic breakthroughs before deep learning’s potential could
be unlocked. This long gestation period helps explain why the 2012 ImageNet
moment was less a sudden revolution and more the culmination of decades
of accumulated research finally finding its moment. As we’ll explore in the
following sections, this evolution has led to two significant developments in
Scan with your phone the field. First, it has given rise to define the field of machine learning sys-
to watch the video
tems engineering, a discipline that teaches how to bridge the gap between
theoretical advancements and practical implementation. Second, it has necessi-
tated a more comprehensive definition of machine learning systems, one that
encompasses not just algorithms, but also data and computing infrastructure.
Today’s challenges of scale echo many of the same fundamental questions about
computation, data, and learning methods that researchers have grappled with
since the field’s inception, but now within a more complex and interconnected
framework.
As AI progressed from symbolic reasoning to statistical learning and deep
learning, its applications became increasingly ambitious and complex. This
growth introduced challenges that extended beyond algorithms, necessitating a
new focus: engineering entire systems capable of deploying and sustaining AI at
scale. This gave rise to the discipline of Machine Learning Systems Engineering.
Let’s consider space exploration. While astronauts venture into new frontiers
and explore the vast unknowns of the universe, their discoveries are only
possible because of the complex engineering systems supporting them, such as
the rockets that lift them into space, the life support systems that keep them
alive, and the communication networks that keep them connected to Earth.
Similarly, while AI researchers push the boundaries of what’s possible with
learning algorithms, their breakthroughs only become practical reality through
careful systems engineering. Modern AI systems need robust infrastructure
to collect and manage data, powerful computing systems to train models, and
reliable deployment platforms to serve millions of users.
This emergence of machine learning systems engineering as a important
discipline reflects a broader reality: turning AI algorithms into real-world
systems requires bridging the gap between theoretical possibilities and practical
implementation. It’s not enough to have a brilliant algorithm if you can’t
efÏciently collect and process the data it needs, distribute its computation
1.5. Defining ML Systems 12
The core of any machine learning system consists of three interrelated compo-
nents, as illustrated in Figure 1.4: Models/Algorithms, Data, and Computing
Infrastructure. These components form a triangular dependency where each
element fundamentally shapes the possibilities of the others. The model archi-
tecture dictates both the computational demands for training and inference, as
well as the volume and structure of data required for effective learning. The
data’s scale and complexity influence what infrastructure is needed for storage
and processing, while simultaneously determining which model architectures
are feasible. The infrastructure capabilities establish practical limits on both
model scale and data processing capacity, creating a framework within which
the other components must operate.
Each of these components serves a distinct but interconnected purpose:
Chapter 1. Introduction 13
logic, machine learning systems derive their behavior from patterns in data.
This shift from code to data as the primary driver of system behavior introduces
new complexities.
As illustrated in Figure 1.5, the ML lifecycle consists of interconnected stages
from data collection through model monitoring, with feedback loops for contin-
uous improvement when performance degrades or models need enhancement.
Model
Training
Needs
Improvement
Data Model
Collection Monitoring
Performance
Degrades
Unlike source code, which changes only when developers modify it, data
reflects the dynamic nature of the real world. Changes in data distributions can
silently alter system behavior. Traditional software engineering tools, designed
for deterministic code-based systems, prove insufÏcient for managing these
data-dependent systems. For example, version control systems that excel at
tracking discrete code changes struggle to manage large, evolving datasets.
Testing frameworks designed for deterministic outputs must be adapted for
probabilistic predictions. This data-dependent nature creates a more dynamic
lifecycle, requiring continuous monitoring and adaptation to maintain system
relevance as real-world data patterns evolve.
Understanding the machine learning system lifecycle requires examining its
distinct stages. Each stage presents unique requirements from both learning
and infrastructure perspectives. This dual consideration, of learning needs and
systems support, is wildly important for building effective machine learning
systems.
However, the various stages of the ML lifecycle in production are not isolated;
they are, in fact, deeply interconnected. This interconnectedness can create
either virtuous or vicious cycles. In a virtuous cycle, high-quality data enables
effective learning, robust infrastructure supports efÏcient processing, and well-
engineered systems facilitate the collection of even better data. However, in a
vicious cycle, poor data quality undermines learning, inadequate infrastructure
hampers processing, and system limitations prevent the improvement of data
collection—each problem compounds the others.
process involves many choices: How should we structure the model? How long
should we train it? How can we tell if it’s learning the right things? Making
these decisions often requires both technical expertise and considerable trial
and error.
A particularly important challenge is ensuring that models work well in
real-world conditions. A model might perform excellently on its training data
but fail when faced with slightly different situations in the real world. This gap
between training performance and real-world performance is a central challenge
in machine learning, especially for critical applications like autonomous vehicles
or medical diagnosis systems.
ML Systems
Purpose
How do the diverse environments where machine learning operates shape the funda-
mental nature of these systems, and what drives their widespread deployment across
computing platforms?
The deployment of machine learning systems across varied computing envi-
ronments reveals essential insights into the relationship between theoretical
principles and practical implementation. Each computing environment, from
large-scale distributed systems to resource-constrained devices, introduces
distinct requirements that influence both system architecture and algorithmic
approaches. Understanding these relationships reveals core engineering princi-
ples that govern the design of machine learning systems. This understanding
provides a foundation for examining how theoretical concepts translate into
practical implementations, and how system designs adapt to meet diverse
computational, memory, and energy constraints.
27
2.1. Overview 28
L Learning Objectives
2.1 Overview
Modern machine learning systems span a spectrum of deployment options,
each with its own set of characteristics and use cases. At one end, we have
cloud-based ML, which leverages powerful centralized computing resources
for complex, data-intensive tasks. Moving along the spectrum, we encounter
edge ML, which brings computation closer to the data source for reduced
latency and improved privacy. Mobile ML further extends these capabilities to
smartphones and tablets, while at the far end, we find Tiny ML, which enables
machine learning on extremely low-power devices with severe memory and
processing constraints.
This spectrum of deployment can be visualized like Earth’s geological fea-
tures, each operating at different scales in our computational landscape. Cloud
ML systems operate like continents, processing vast amounts of data across
interconnected centers; Edge ML exists where these continental powers meet
the sea, creating dynamic coastlines where computation flows into local waters;
Mobile ML moves through these waters like ocean currents, carrying comput-
ing power across the digital seas; and where these currents meet the physical
world, TinyML systems rise like islands, each a precise point of intelligence in
the vast computational ocean.
Figure 2.2 illustrates the spectrum of distributed intelligence across these
approaches, providing a visual comparison of their characteristics. We will
examine the unique characteristics, advantages, and challenges of each ap-
proach, as depicted in the figure. Additionally, we will discuss the emerging
trends and technologies that are shaping the future of machine learning de-
ployment, considering how they might influence the balance between these
three paradigms.
To better understand the dramatic differences between these ML deployment
options, Table 2.1 provides examples of representative hardware platforms
for each category. These examples illustrate the vast range of computational
resources, power requirements, and cost considerations across the ML sys-
tems spectrum. As we explore each paradigm in detail, you can refer back to
Chapter 2. ML Systems 29
TinyML Cloud AI
Edge AI
Source: ABI Research: TinyML
models, and is ideal for applications where real-time responsiveness isn’t critical.
Popular platforms like AWS SageMaker, Google Cloud AI, and Azure ML offer
flexible, scalable solutions for model development, training, and deployment.
Cloud ML can handle models with billions of parameters, training on petabytes
of data, but may incur latencies of 100-500 ms for online inference due to network
delays.
Edge ML: As the need for real-time, low-latency processing grew, Edge
ML emerged. This paradigm brings inference capabilities closer to the data
source, typically on edge devices such as industrial gateways, smart cameras,
autonomous vehicles, or IoT hubs. Edge ML reduces latency (often to less than
50 ms), enhances privacy by keeping data local, and can operate with inter-
mittent cloud connectivity. It’s particularly useful for applications requiring
quick responses or handling sensitive data in industrial or enterprise settings.
Frameworks like NVIDIA Jetson or Google’s Edge TPU enable powerful ML
capabilities on edge devices. Edge ML plays a crucial role in IoT ecosystems, en-
abling real-time decision making and reducing bandwidth usage by processing
data locally.
Mobile ML: Building on edge computing concepts, Mobile ML focuses on
leveraging the computational capabilities of smartphones and tablets. This
approach enables personalized, responsive applications while reducing reliance
on constant network connectivity. Mobile ML offers a balance between the
power of edge computing and the ubiquity of personal devices. It utilizes on-
device sensors (e.g., cameras, GPS, accelerometers) for unique ML applications.
Frameworks like TensorFlow Lite and Core ML allow developers to deploy
optimized models on mobile devices, with inference times often under 30 ms
for common tasks. Mobile ML enhances privacy by keeping personal data on
the device and can operate ofÒine, but must balance model performance with
device resource constraints (typically 4-8 GB RAM, 100-200 GB storage).
Tiny ML: The latest development in this progression is Tiny ML, which
enables ML models to run on extremely resource-constrained microcontrollers
and small embedded systems. Tiny ML allows for on-device inference without
relying on connectivity to the cloud, edge, or even the processing power of
mobile devices. This approach is crucial for applications where size, power
consumption, and cost are critical factors. Tiny ML devices typically operate
with less than 1 MB of RAM and flash memory, consuming only milliwatts of
power, enabling battery life of months or years. Applications include wake
word detection, gesture recognition, and predictive maintenance in industrial
settings. Platforms like Arduino Nano 33 BLE Sense and STM32 microcon-
trollers, coupled with frameworks like TensorFlow Lite for Microcontrollers,
enable ML on these tiny devices. However, Tiny ML requires significant model
0 optimization and quantization0 to fit within these constraints.
Quantization: Process of reduc-
ing the numerical precision of ML
Each of these paradigms has its own strengths and is suited to different use
model parameters to reduce mem-
cases:
ory footprint and computational de- • Cloud ML remains essential for tasks requiring massive computational
mand. power or large-scale data analysis.
• Edge ML is ideal for applications needing low-latency responses or local
data processing in industrial or enterprise environments.
Chapter 2. ML Systems 31
4× 3100× gap
Figure 2.3: From cloud GPUs to mi-
Memory 16 GB 4 GB 320 kB 7.2 MB 6.8 MB 1.7 MB crocontrollers: Navigating the mem-
Storage TB ∼ PB
1000×
> 64 GB
6400×
1 MB
gap
102 MB 13.6 MB 3.4 MB
ory and storage landscape across
computing devices. Source: (Ji Lin,
Zhu, et al. 2023)
Figure 2.3 illustrates the key differences between Cloud ML, Edge ML, Mobile
ML, and Tiny ML in terms of hardware, latency, connectivity, power require-
ments, and model complexity. As we move from Cloud to Edge to Tiny ML,
we see a dramatic reduction in available resources, which presents significant
challenges for deploying sophisticated machine learning models. This resource
disparity becomes particularly apparent when attempting to deploy deep learn-
ing models on microcontrollers, the primary hardware platform for Tiny ML.
These tiny devices have severely constrained memory and storage capacities,
which are often insufÏcient for conventional deep learning models. We will
learn to put these things into perspective in this chapter. 1
The cloud refers to networks
of remote computing servers that
2.2 Cloud-Based Machine Learning provide scalable storage, processing
power, and specialized services for
The vast computational demands of modern machine learning often require
deploying machine learning mod-
the scalability and power of centralized cloud1 infrastructures. Cloud Machine
els.
Learning (Cloud ML) handles tasks such as large-scale data processing, col-
laborative model development, and advanced analytics. Cloud data centers 2
Recommendation systems: An
leverage distributed architectures, offering specialized resources to train com- AI technology used to personalize
plex models and support diverse applications, from recommendation systems2 user experiences by predicting and
to natural language processing3 . showcasing what users would enjoy
or find suitable based on their past
Definition of Cloud ML behavior or interactions.
3
Cloud Machine Learning (Cloud ML) refers to the deployment of ma- Natural Language Processing
chine learning models on centralized computing infrastructures, such as (NLP): A branch of AI that gives
data centers. These systems operate in the kilowatt to megawatt power machines the ability to read, under-
range and utilize specialized computing systems to handle large-scale datasets stand and derive meaning from hu-
and train complex models. Cloud ML offers scalability and computational man languages to perform tasks like
capacity, making it well-suited for tasks requiring extensive resources translation, sentiment analysis, and
topic classification.
2.2. Cloud-Based Machine Learning 32
4
Tensor Processing Units (TPUs)
are Google’s custom-designed AI
accelerator chips optimized for ma-
chine learning workloads, particu- 2.2.1 Characteristics
larly deep neural network training
One of the key characteristics of Cloud ML is its centralized infrastructure.
and inference.
Figure 2.5 illustrates this concept with an example from Google’s Cloud TPU4
5
Virtual platforms abstract phys- data center. Cloud service providers offer a virtual platform5 that consists
ical hardware through software in- of high-capacity servers, expansive storage solutions, and robust networking
terfaces, enabling efÏcient resource architectures, all housed in data centers distributed across the globe. As shown
management and automated scal- in the figure, these centralized facilities can be massive in scale, housing rows
ing across multiple users without upon rows of specialized hardware. This centralized setup allows for the
direct hardware interaction. pooling and efÏcient management of computational resources, making it easier
to scale machine learning projects as needed.6
6
While centralized infrastruc- Cloud ML excels in its ability to process and analyze massive volumes of data.
ture enables efÏcient resource man- The centralized infrastructure is designed to handle complex computations and
agement and scalability, increasing model training tasks that require significant computational power. By lever-
physical distance between data cen- aging the scalability of the cloud, machine learning models can be trained on
ters and end-users can introduce la- vast amounts of data, leading to improved learning capabilities and predictive
tency and data privacy challenges. performance.
Chapter 2. ML Systems 33
Cloud ML promotes collaboration and resource sharing among teams and or- tem of interrelated computing de-
ganizations. The centralized nature of the cloud infrastructure enables multiple vices, mechanical and digital ma-
data scientists and engineers to access and work on the same machine learning chines, capable of transferring data
projects simultaneously. This collaborative approach facilitates knowledge over a network without human-to-
sharing, accelerates the development cycle from experimentation to production, human or human-to-computer inter-
2.2.2 Benefits
Cloud ML offers several significant benefits that make it a powerful choice for
machine learning projects:
One of the key advantages of Cloud ML is its ability to provide vast com-
putational resources. The cloud infrastructure is designed to handle complex
algorithms and process large datasets efÏciently. This is particularly beneficial
for machine learning models that require significant computational power, such
2.2. Cloud-Based Machine Learning 34
2.2.3 Challenges
While Cloud ML offers numerous benefits, it also comes with certain challenges
that organizations need to consider:
Chapter 2. ML Systems 35
Definition of Edge ML 13
IoT Hubs: Devices or services
that manage data communication
Edge Machine Learning (Edge ML) describes the deployment of machine between IoT devices and the cloud.
learning models at or near the edge of the network. These systems operate in
the tens to hundreds of watts range and rely on localized hardware optimized
for real-time processing. Edge ML minimizes latency and enhances privacy
by processing data locally, but its primary limitation lies in restricted
computational resources.
2.3.1 Characteristics
In Edge ML, data processing happens in a decentralized fashion, as illustrated
in Figure 2.7. Instead of sending data to remote servers, the data is processed
locally on devices like smartphones, tablets, or Internet of Things (IoT) devices.
The figure showcases various examples of these edge devices, including wear-
ables, industrial sensors, and smart home appliances. This local processing
allows devices to make quick decisions based on the data they collect without
relying heavily on a central server’s resources.
2.3. Edge Machine Learning 38
Local data storage and computation are key features of Edge ML. This setup
ensures that data can be stored and analyzed directly on the devices, thereby
maintaining the privacy of the data and reducing the need for constant inter-
net connectivity. Moreover, this approach reduces latency in decision-making
processes, as computations occur closer to where data is generated. This prox-
imity not only enhances real-time capabilities but also often results in more
efÏcient resource utilization, as data doesn’t need to travel across networks,
saving bandwidth and energy consumption.
2.3.2 Benefits
One of Edge ML’s main advantages is the significant latency reduction compared
to Cloud ML. This reduced latency can be a critical benefit in situations where
milliseconds count, such as in autonomous vehicles, where quick decision-
making can mean the difference between safety and an accident.
Edge ML also offers improved data privacy, as data is primarily stored and
processed locally. This minimizes the risk of data breaches that are more
common in centralized data storage solutions. Sensitive information can be
kept more secure, as it’s not sent over networks that could be intercepted.
Operating closer to the data source means less data must be sent over net-
works, reducing bandwidth usage. This can result in cost savings and efÏciency
gains, especially in environments where bandwidth is limited or costly.
2.3.3 Challenges
However, Edge ML has its challenges. One of the main concerns is the limited
computational resources compared to cloud-based solutions. Endpoint devices
may have a different processing power or storage capacity than cloud servers,
limiting the complexity of the machine learning models that can be deployed.
Managing a network of edge nodes can introduce complexity, especially
regarding coordination, updates, and maintenance. Ensuring all nodes operate
Chapter 2. ML Systems 39
seamlessly and are up-to-date with the latest algorithms and security protocols
can be a logistical challenge.
While Edge ML offers enhanced data privacy, edge nodes can sometimes
be more vulnerable to physical and cyber-attacks. Developing robust security
protocols that protect data at each node without compromising the system’s
efÏciency remains a significant challenge in deploying Edge ML solutions.
Definition of Mobile ML
2.4.1 Characteristics
Mobile ML utilizes the processing power of mobile devices’ System-on-Chip
(SoC)15 architectures, including specialized Neural Processing Units (NPUs)16
15
System-on-Chip (SoC): An in- and AI accelerators. This enables efÏcient execution of ML models directly on
tegrated circuit that packages essen- the device, allowing for real-time processing of data from device sensors like
tial components of a computer or cameras, microphones, and motion sensors without constant cloud connectivity.
other system into a single chip. Mobile ML is supported by specialized frameworks and tools designed specif-
ically for mobile deployment, such as TensorFlow Lite for Android devices and
16
Neural Processing Unit (NPU): Core ML for iOS devices. These frameworks are optimized for mobile hard-
A specialized hardware unit de- ware and provide efÏcient model compression17 and quantization techniques
signed for accelerated processing to ensure smooth performance within mobile resource constraints.
of AI and machine learning algo-
rithms. 2.4.2 Benefits
17 Mobile ML enables real-time processing of data directly on mobile devices,
Model compression re-
eliminating the need for constant server communication. This results in faster
duces ML model size through tech-
response times for applications requiring immediate feedback, such as real-time
niques like pruning, quantization,
translation, face detection, or gesture recognition.
and knowledge distillation. This
By processing data locally on the device, Mobile ML helps maintain user
process decreases memory require-
privacy. Sensitive information doesn’t need to leave the device, reducing the
ments and computational demands
risk of data breaches and addressing privacy concerns, particularly important
while preserving key model func-
for applications handling personal data.
tionality, enabling efÏcient deploy-
Mobile ML applications can function without constant internet connectivity,
ment on resource-constrained de-
making them reliable in areas with poor network coverage or when users are
vices.
ofÒine. This ensures consistent performance and user experience regardless of
network conditions.
2.4.3 Challenges
Despite modern mobile devices being powerful, they still face resource con-
straints compared to cloud servers. Mobile ML must operate within limited
RAM, storage, and processing power, requiring careful optimization of models
and efÏcient resource management.
ML operations can be computationally intensive, potentially impacting device
battery life. Developers must balance model complexity and performance with
power consumption to ensure reasonable battery life for users.
Mobile devices have limited storage space, necessitating careful considera-
tion of model size. This often requires model compression and quantization
techniques, which can affect model accuracy and performance.
Chapter 2. ML Systems 41
Definition of Tiny ML
Figure 2.8 encapsulates the key aspects of Tiny ML discussed in this section.
2.5.1 Characteristics
In Tiny ML, the focus, much like in Mobile ML, is on on-device machine learning.
This means that machine learning models are deployed and trained on the
device, eliminating the need for external servers or cloud infrastructures. This
allows Tiny ML to enable intelligent decision-making right where the data
is generated, making real-time insights and actions possible, even in settings
where connectivity is limited or unavailable.
Chapter 2. ML Systems 43
2.5.2 Benefits
One of the standout benefits of Tiny ML is its ability to offer ultra-low latency.
Since computation occurs directly on the device, the time required to send
data to external servers and receive a response is eliminated. This is crucial in
applications requiring immediate decision-making, enabling quick responses
to changing conditions.
Tiny ML inherently enhances data security. Because data processing and
analysis happen on the device, the risk of data interception during transmission
is virtually eliminated. This localized approach to data management ensures
that sensitive information stays on the device, strengthening user data security.
Tiny ML operates within an energy-efÏcient framework, a necessity given
its resource-constrained environments. By employing lean algorithms and
optimized computational methods, Tiny ML ensures that devices can execute
complex tasks without rapidly depleting battery life, making it a sustainable
option for long-term deployments.
2.5.3 Challenges
However, the shift to Tiny ML comes with its set of hurdles. The primary
limitation is the devices’ constrained computational capabilities. The need to
operate within such limits means that deployed models must be simplified,
which could affect the accuracy and sophistication of the solutions.
Tiny ML also introduces a complicated development cycle. Crafting light-
weight and effective models demands a deep understanding of machine learn-
ing principles and expertise in embedded systems. This complexity calls for a
2.6. Hybrid Machine Learning 44
Definition of Hybrid ML
and analyze data from multiple sensors, and cloud systems handle complex
analytics and model updates. For instance, we might see ESP32-CAM devices
performing basic image classification at the sensor level with their minimal 520
KB RAM, feeding data up to Jetson AGX Orin devices for more sophisticated
computer vision tasks, and ultimately connecting to cloud infrastructure for
complex analytics and model updates.
This hierarchy allows each tier to handle tasks appropriate to its capabilities.
Tiny ML devices handle immediate, simple decisions; edge devices manage
local coordination; and cloud systems tackle complex analytics and learning
tasks. Smart city installations often use this pattern, with street-level sensors
feeding data to neighborhood-level edge processors, which in turn connect to
city-wide cloud analytics.
TinyML Cloud ML
Results Results
Assist
Results
Processing Analytics Processing
Results Data
Edge ML Mobile ML
ML System Implementations
The figure shows three key layers that help us understand how ML systems
relate to each other. At the top, we see the diverse implementations that we
have explored throughout this chapter. Cloud ML operates in data centers,
focusing on training at scale with vast computational resources. Edge ML
emphasizes local processing with inference capabilities closer to data sources.
Mobile ML leverages personal devices for user-centric applications. Tiny ML
brings intelligence to highly constrained embedded systems and sensors.
Despite their distinct characteristics, the arrows in the figure show how
all these implementations connect to the same core system principles. This
reflects an important reality in ML systems, even though they may operate at
dramatically different scales, from cloud systems processing petabytes to tiny
devices handling kilobytes, they all must solve similar fundamental challenges
in terms of:
• Managing data pipelines from collection through processing to deploy-
ment
• Balancing resource utilization across compute, memory, energy, and net-
work
• Implementing system architectures that effectively integrate models, hard-
ware, and software
limited memory. Despite these scale differences, all systems must address the
same fundamental challenges of data ingestion, transformation, and utilization.
Resource Management emerges as a universal challenge across all implemen-
tations. Whether managing thousands of GPUs in a data center or optimizing
battery life on a microcontroller, all systems must balance competing demands
for computation, memory, energy, and network resources. The quantities in-
volved may differ by orders of magnitude, but the core principles of resource
allocation and optimization remain remarkably consistent.
System Architecture principles guide how ML systems integrate models,
hardware, and software components. Cloud architectures might focus on
distributed computing and scalability, while tiny systems emphasize efÏcient
memory mapping and interrupt handling. Yet all must solve fundamental
problems of component integration, data flow optimization, and processing
coordination.
Table 2.2: Comparison of feature aspects across Cloud ML, Edge ML, and Tiny
ML.
Aspect Cloud ML Edge ML Mobile ML Tiny ML
Perfor-
mance
Process- Centralized cloud Local edge devices Smartphones and Ultra-low-power
ing servers (Data (gateways, servers) tablets microcontrollers and
Location Centers) embedded systems
Latency High (100 ms-1000 Moderate (10-100 ms) Low-Moderate (5-50 Very Low (1-10 ms)
ms+) ms)
Compute Very High (Multiple High (Edge GPUs) Moderate (Mobile Very Low (MCU/tiny
Power GPUs/TPUs) NPUs/GPUs) processors)
Storage Unlimited Large (terabytes) Moderate (gigabytes) Very Limited
Capacity (petabytes+) (kilobytes-megabytes)
Energy Very High (kW-MW High (100 s W) Moderate (1-10 W) Very Low (mW
Con- range) range)
sumption
Scalabil- Excellent (virtually Good (limited by Moderate (per-device Limited (fixed
ity unlimited) edge hardware) scaling) hardware)
Opera-
tional
Data Basic-Moderate (Data High (Data stays in High (Data stays on Very High (Data
Privacy leaves device) local network) phone) never leaves sensor)
Connec- Constant Intermittent Optional None
tivity high-bandwidth
Required
OfÒine None Good Excellent Complete
Capabil-
ity
Real-time Dependent on Good Very Good Excellent
Process- network
ing
Deploy-
ment
Cost High Moderate Low ($0-10s) Very Low ($1-10s)
($1000s+/month) ($100s-1000s)
Hard- Cloud infrastructure Edge Modern smartphones MCUs/embedded
ware servers/gateways systems
Require-
ments
Develop- High (cloud expertise Moderate-High Moderate (mobile High (embedded
ment needed) (edge+networking) SDKs) expertise)
Complex-
ity
Deploy- Fast Moderate Fast Slow
ment
Speed
ML achieves even lower latencies of 5-50 ms for many tasks, and TinyML
systems can respond in 1-10 ms for simple inferences. Similarly, privacy and
data handling improve progressively as computation shifts closer to the data
source, with TinyML offering the strongest guarantees by keeping data entirely
local to the device.
The table is designed to provide a high-level view of how these paradigms
differ across key dimensions, making it easier to understand the trade-offs and
select the most appropriate approach for specific deployment needs.
To complement the details presented in Table 2.2, radar plots are presented
below. These visualizations highlight two critical dimensions: performance
characteristics and operational characteristics. The performance characteristics
plot in Figure 2.12 focuses on latency, compute power, energy consumption,
and scalability. As discussed earlier, Cloud ML demands exceptional compute
power and demonstrates good scalability, making it ideal for large-scale tasks
requiring extensive resources. Tiny ML, in contrast, excels in latency and energy
efÏciency due to its lightweight and localized processing, suitable for low-power,
real-time scenarios. Edge ML and Mobile ML strike a balance, offering moderate
scalability and efÏciency for a variety of applications.
2.13:
Figure 2.12:
Layer: Privacy
Start
Figure 2.14: A decision flowchart for
selecting the most suitable ML de-
ployment paradigm. No Is privacy critical? Yes
Layer: Performance
Is low latency required
No Yes
(<10 ms)?
Lightweight
Heavy Compute
Processing
Layer: Cost
Are there strict cost
No Yes
constraints?
Low-Cost
Flexible Budget
Options
2.10 Conclusion
This chapter has explored the diverse landscape of machine learning systems,
highlighting their unique characteristics, benefits, challenges, and applications.
Cloud ML leverages immense computational resources, excelling in large-scale
data processing and model training but facing limitations such as latency and
privacy concerns. Edge ML bridges this gap by enabling localized processing,
reducing latency, and enhancing privacy. Mobile ML builds on these strengths,
harnessing the ubiquity of smartphones to provide responsive, user-centric
applications. At the smallest scale, Tiny ML extends the reach of machine
learning to resource-constrained devices, opening new domains of application.
Together, these paradigms reflect an ongoing progression in machine learn-
ing, moving from centralized systems in the cloud to increasingly distributed
and specialized deployments across edge, mobile, and tiny devices. This evolu-
tion marks a shift toward systems that are finely tuned to specific deployment
contexts, balancing computational power, energy efÏciency, and real-time re-
Chapter 2. ML Systems 57
2.11 Resources
Slides
çĖ Videos
• Coming soon.
¸Î Exercises
DL Primer
Purpose
What inspiration from nature drives the development of machine learning systems, and
how do biological neural processes inform their fundamental design?
The neural systems of nature offer profound insights into information process-
ing and adaptation, inspiring the core principles of modern machine learning.
Translating biological mechanisms into computational frameworks illuminates
fundamental patterns that shape artificial neural networks. These patterns
reveal essential relationships between biological principles and their digital
counterparts, establishing building blocks for understanding more complex
architectures. Analyzing these mappings from natural to artificial provides crit-
ical insights into system design, laying the foundation for exploring advanced
neural architectures and their practical implementations.
59
3.1. Overview 60
L Learning Objectives
3.1 Overview
Neural networks, a foundational concept within machine learning and artificial
intelligence, are computational models inspired by the structure and function
of biological neural systems. These networks represent a critical intersection of
algorithms, mathematical frameworks, and computing infrastructure, making
them integral to solving complex problems in AI.
When studying neural networks, it is helpful to place them within the broader
hierarchy of AI and machine learning. Figure 3.2 provides a visual representa-
tion of this context. AI, as the overarching field, encompasses all computational
methods that aim to mimic human cognitive functions. Within AI, machine
learning includes techniques that enable systems to learn patterns from data.
Neural networks, a key subset of ML, form the backbone of more advanced
learning systems, including deep learning, by modeling complex relationships
in data through interconnected computational units.
The emergence of neural networks reflects key shifts in how AI systems
process information across three fundamental dimensions:
• Data: From manually structured and rule-based datasets to raw, high-
dimensional data. Neural networks are particularly adept at learning
from complex and unstructured data, making them essential for tasks
involving images, speech, and text.
• Algorithms: From explicitly programmed rules to adaptive systems ca-
pable of learning patterns directly from data. Neural networks eliminate
the need for manual feature engineering by discovering representations
automatically through layers of interconnected units.
• Computation: From simple, sequential operations to massively parallel
computations. The scalability of neural networks has driven demand
for advanced hardware, such as GPUs, that can efÏciently process large
models and datasets.
Chapter 3. DL Primer 61
if (ball.collide(brick)) {
removeBrick();
ball.dx = 1.1 * (ball.dx);
ball.dy = -1 * (ball.dy);
}
Rules
Data
This challenge extends to computer vision tasks. Detecting objects like cats in
images would require rules about System Implications: pointed ears, whiskers,
typical body shapes. Such rules would need to account for variations in viewing
0
Knowledge Engineering: angle, lighting conditions, partial occlusions, and natural variations among
The process of creating rules and instances. Early computer vision systems attempted this approach through
heuristics for problem-solving and geometric rules but achieved success only in controlled environments with
decision-making within artificial in- well-defined objects.
telligence systems. This knowledge engineering approach0 characterized artificial intelligence
Chapter 3. DL Primer 63
research in the 1970s and 1980s. Expert systems1 encoded domain knowledge
1
as explicit rules, showing promise in specific domains with well-defined pa- Expert systems: An AI pro-
rameters but struggling with tasks humans perform naturally, such as object gram that leverages expert knowl-
recognition, speech understanding, or natural language interpretation. These edge in a particular field to answer
limitations highlighted a fundamental challenge: many aspects of intelligent questions or solve problems.
behavior rely on implicit knowledge that resists explicit rule-based representa-
tion.
Answers
Data
These fundamental shifts explain why deep learning has spurred innovations
across the entire computing stack. From specialized hardware accelerators3 to
new memory architectures4 to sophisticated software frameworks, the demands
4
of deep learning continue to reshape computer system design. Interestingly, Memory architecture: The de-
many of these challenges, efÏciency, scaling, and adaptability, are ones that sign of a computer’s memory sys-
biological systems have already solved. This brings us to a critical question: tem, including the physical struc-
what can we learn from nature’s own information processing system and strive ture and components, data organi-
to mimic them as artificially intelligent systems. zation and access, and pathways
between memory and computing
units.
3.3 Biological to Artificial Neurons
The quest to create artificial intelligence has been profoundly influenced by
our understanding of biological intelligence, particularly the human brain.
This isn’t surprising; the brain represents the most sophisticated information
processing system we know of. It is capable of learning, adapting, and solving
complex problems while maintaining remarkable energy efÏciency. The way
our brains function has provided fundamental insights that continue to shape
how we approach artificial intelligence.
cognitive tasks that would require orders of magnitude more power in cur-
rent artificial systems. This efÏciency hasn’t just impressed researchers; it has
become a crucial goal in the development of AI hardware and algorithms.
These biological principles have led to two distinct but complementary ap-
proaches in artificial intelligence. The first attempts to directly mimic neural
structure and function, leading to artificial neural networks and deep learn-
ing architectures that structurally resemble biological neural networks. The
second takes a more abstract approach, adapting biological principles to work
efÏciently within the constraints of computer hardware without necessarily
copying biological structures exactly. In the following sections, we will explore
how these approaches manifest in practice, beginning with the fundamental
building block of neural networks: the neuron itself.
A biological neuron consists of several key components. The central part is the
cell body, or soma, which contains the nucleus and performs the cell’s basic life
processes. Extending from the soma are branch-like structures called dendrites,
which receive signals from other neurons. At the junctions where signals are
passed between neurons are synapses. Finally, a long, slender projection called
the axon conducts electrical impulses away from the cell body to other neurons.
The neuron functions as follows: Dendrites receive inputs from other neurons,
with synapses determining the strength of the connections. The soma integrates
these signals and decides whether to trigger an output signal. If triggered, the
axon transmits this signal to other neurons.
Each element of a biological neuron has a computational analog in artificial
systems, reflecting the principles of learning, adaptability, and efÏciency found
Chapter 3. DL Primer 69
from thousands of other neurons and producing a binary output signal based
on whether this integrated input exceeds a threshold. The connection strengths
between neurons, mediated by synapses, are continuously modified through
experience. This synaptic plasticity5 forms the basis for learning and adap-
5
Synaptic Plasticity: The abil- tation in biological neural networks. These biological principles suggest key
ity of connections between neurons computational elements needed in artificial neural systems:
to change in strength in response to
• Simple processing units that integrate multiple inputs
changes in synaptic activity.
• Adjustable connection strengths between units
• Nonlinear activation based on input thresholds
• Parallel processing architecture
• Learning through modification of connection strengths
The basic computational unit in artificial neural networks, the artificial neu-
ron, simplifies the complex electrochemical processes of biological neurons into
three fundamental operations. First, input signals are weighted, mimicking
how biological synapses modulate incoming signals with different strengths.
Second, these weighted inputs are summed together, analogous to how a bio-
logical neuron integrates incoming signals in its cell body. Finally, the summed
input passes through an activation function that determines the neuron’s out-
put, similar to how a biological neuron fires based on whether its membrane
potential exceeds a threshold.
This mathematical abstraction preserves key computational principles while
enabling efÏcient digital implementation. The weighting of inputs allows the
network to learn which connections are important, just as biological neural
networks strengthen or weaken synaptic connections through experience. The
summation operation captures how biological neurons integrate multiple inputs
Chapter 3. DL Primer 71
The evolutionary trends were driven by parallel advances across three funda-
mental dimensions: data availability, algorithmic innovations, and computing
infrastructure. These three factors, namely, data, algorithms, and infrastructure,
reinforced each other in a virtuous cycle that continues to drive progress in
the field today. As Figure 9.15 shows, more powerful computing infrastructure
enabled processing larger datasets. Larger datasets drove algorithmic inno-
vations. Better algorithms demanded more sophisticated computing systems.
This virtuous cycle continues to drive progress in the field today.
Key Breakthroughs
The data revolution transformed what was possible with neural networks.
The rise of the internet and digital devices created unprecedented access to
training data. Image sharing platforms provided millions of labeled images.
Digital text collections enabled language processing at scale. Sensor networks
and IoT devices generated continuous streams of real-world data. This abun-
dance of data provided the raw material needed for neural networks to learn
complex patterns effectively.
Algorithmic innovations made it possible to harness this data effectively.
New methods for initializing networks and controlling learning rates made
training more stable. Techniques for preventing overfitting allowed models to
generalize better to new data. Most importantly, researchers discovered that
neural network performance scaled predictably with model size, computation,
and data quantity, leading to increasingly ambitious architectures.
Computing infrastructure evolved to meet these growing demands. On the
hardware side, graphics processing units (GPUs) provided the parallel process-
ing capabilities needed for efÏcient neural network computation. Specialized AI
3.4. Neural Network Fundamentals 74
accelerators like TPUs (Jouppi, Young, et al. 2017a) pushed performance further.
High-bandwidth memory systems and fast interconnects addressed data move-
ment challenges. Equally important were software advances—frameworks
and libraries that made it easier to build and train networks, distributed com-
puting systems that enabled training at scale, and tools for optimizing model
deployment.
Inputs Weights
• • •
b Activation
function
xi wij Bias
𝑧 = ∑(𝑥𝑖 ⋅ 𝑤𝑖𝑗 )
To this intermediate calculation, a bias term 𝑏 is added, allowing the model
to better fit the data by shifting the linear output function up or down. Thus,
the intermediate linear combination computed by the perceptron including the
bias becomes:
𝑧 = ∑(𝑥𝑖 ⋅ 𝑤𝑖𝑗 ) + 𝑏
Common activation functions include:8
8
Activation Function: A mathe-
• ReLU (Rectified Linear Unit): Defined as 𝑓(𝑥) = max(0, 𝑥), it introduces
matical ‘gate’ in between the input
sparsity and accelerates convergence in deep networks. Its simplicity and
from the previous layer and the out-
effectiveness have made it the default choice in many modern architec-
put of the current layer, adding non-
tures.
linearity to model complex patterns.
• Sigmoid: Historically popular, the sigmoid function maps inputs to a
range between 0 and 1 but is prone to vanishing gradients in deeper
architectures. It’s particularly useful in binary classification problems
where probabilities are needed.
• Tanh: Similar to sigmoid but maps inputs to a range of −1 to 1, centering
the data. This centered output often leads to faster convergence in practice
compared to sigmoid.
3.4. Neural Network Fundamentals 76
These activation functions transform the linear input sum into a non-linear
output:
𝑦 ̂ = 𝜎(𝑧)
Thus, the final output of the perceptron, including the activation function,
can be expressed as:
Figure 3.12 shows an example where data exhibit a nonlinear pattern that
could not be adequately modeled with a linear approach. The activation func-
tion enables the network to learn and represent complex relationships in the
data, making it possible to solve sophisticated tasks like image recognition or
speech processing.
Thus, the final output of the perceptron, including the activation function,
can be expressed as:
𝑧 = 𝜎 (∑(𝑥𝑖 ⋅ 𝑤𝑖𝑗 ) + 𝑏)
•••
Where:
• x(𝑙−1) is the input vector from the previous layer
• W(𝑙) is the weight matrix for the current layer
• b(𝑙) is the bias vector
• z(𝑙) is the pre-activation output9
9
Pre-activation output: The
Now that we have covered the basics, Video 1 provides a great overview of output produced by a neuron in a
how neural networks work using handwritten digit recognition. It introduces neural network before the activation
some new concepts that we will explore in more depth soon, but it serves as an function is applied.
excellent introduction.
TV Watch on YouTube
Watch on YouTube
Neural Network
3.4.2 Weights and Biases
3.4.2.1 Weight Matrices
Weights in neural networks determine how strongly inputs influence the output
of a neuron. While we first discussed weights for a single perceptron, in larger Scan with your phone
to watch the video
networks, weights are organized into matrices for efÏcient computation across
entire layers. For example, in a layer with 𝑛 input features and 𝑚 neurons, the
weights form a matrix W ∈ ℝ𝑛×𝑚 . Each column in this matrix represents the
weights for a single neuron in the layer. This organization allows the network
3.4. Neural Network Fundamentals 78
z = x𝑇 W + b
hBias0 = 0.13
Figure 3.14: Dense connections be-
.8337 oBias0 = 0.25
tween layers in a MLP. Source: J. Mc- = 0.01
ihWeight00 hoWeig
Caffrey 1.0 ht00 = 0
17
0.02 01
0 0.14 8 .4886
0. .03
04
019
.8764
5
0.0 02
0
0.06
5.0 0.07
0.15
0.0 1
8 02
.9087 oBias1 = 0.26
022
09
0.
0
0.1 3 .5114
0.11 02
24
9.0 ht31 = 0
ihWeight23 =
0.12 hoWeig
.9329
hBias3 = 0.16
Chapter 3. DL Primer 79
Where h(𝑙) represents the layer’s output after applying the activation function.
where higher values indicate greater confidence that the image represents that
particular digit.
Between these fixed input and output layers, we have flexibility in designing
the hidden layer topology. The choice of hidden layer structure, including
the number of layers to use and their respective widths, represents one of the
fundamental design decisions in neural networks. Additional layers increase the
network’s depth, allowing it to learn more abstract features through successive
transformations. The width of each layer provides capacity for learning different
features at each level of abstraction.
draws inspiration from biological neural systems, where neurons typically form
connections with a limited number of other neurons. In visual processing tasks
like our MNIST example, neurons might connect only to inputs representing
nearby pixels, reflecting the local nature of visual features.
As networks grow deeper, the path from input to output becomes longer,
potentially complicating the learning process. Skip connections address this by
adding direct paths between non-adjacent layers. These connections provide
alternative routes for information flow, supplementing the standard layer-by-
layer progression. In our digit recognition example, skip connections might
allow later layers to reference both high-level patterns and the original pixel
values directly.
These connection patterns have significant implications for both the theo-
retical capabilities and practical implementation of neural networks. Dense
connections maximize learning flexibility at the cost of computational efÏciency.
Sparse connections can reduce computational requirements while potentially
improving the network’s ability to learn structured patterns. Skip connections
help maintain effective information flow in deeper networks.
𝑦 ̂ = 𝑓(𝑥; 𝜃)
where 𝑓 represents the neural network function and 𝜃 represents all trainable
parameters (weights and biases, which we discussed earlier). The network’s
error is measured by a loss function 𝐿:
loss = 𝐿(𝑦,̂ 𝑦)
Forward Propagation
X3
Loss
Function L
Weights Parameters
Optimizer Loss Score
& bias update
Backward Propagation
𝑛
𝑧 = ∑ 𝑤 𝑖 𝑥𝑖 + 𝑏
𝑖=1
Chapter 3. DL Primer 85
where 𝑤𝑖 represents the weights, 𝑥𝑖 the inputs, and 𝑏 the bias term. For an entire
layer of neurons, we can express this more efÏciently using matrix operations:
Z(𝑙) = W(𝑙) A(𝑙−1) + b(𝑙)
Here, W(𝑙) represents the weight matrix for layer 𝑙, A(𝑙−1) contains the activa-
tions from the previous layer, and b(𝑙) is the bias vector.
Following this linear transformation, each layer applies a nonlinear activation
function 𝑓:
A(𝑙) = 𝑓(Z(𝑙) )
This process repeats at each layer, creating a chain of transformations:
Input → Linear Transform → Activation → Linear Transform → Activation
→ … → Output
In our MNIST example, the pixel values first undergo a transformation by
the first hidden layer’s weights, converting the 784-dimensional input into an
intermediate representation. Each subsequent layer further transforms this rep-
resentation, ultimately producing a 10-dimensional output vector representing
the network’s confidence in each possible digit.
A(𝐿) = 𝑓 (𝐿) (W(𝐿) 𝑓 (𝐿−1) (W(𝐿−1) ⋯ (𝑓 (1) (W(1) X + b(1) )) ⋯ + b(𝐿−1) ) + b(𝐿) )
For each image in our batch, this gives us a probability distribution over the
possible digits. The digit with the highest probability becomes the network’s
prediction.
[0.1, 0.1, 0.1, 0.0, 0.0, 0.0, 0.2, 0.3, 0.1, 0.1]
The highest confidence (0.3) is assigned to digit “7”, but this confidence is
quite low, indicating uncertainty in the prediction. A good loss function would
produce a high loss value here, signaling that the network needs significant
improvement. Conversely, if the network outputs:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9, 0.0, 0.1]
The loss function should produce a lower value, as this prediction is much
closer to ideal.
3.5. Learning Process 88
1 𝐵
𝐿batch = ∑ 𝐿(𝑦𝑖̂ , 𝑦𝑖 )
𝐵 𝑖=1
where 𝐵 is the batch size and (𝑦𝑖̂ , 𝑦𝑖 ) represents the prediction and truth for
the 𝑖-th example.
The choice of loss function depends on the type of task. For our MNIST
classification problem, we need a loss function that can:
1. Handle probability distributions over multiple classes
2. Provide meaningful gradients for learning
3. Penalize wrong predictions effectively
4. Scale well with batch processing
𝑦 = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
The cross-entropy loss for this example is:
10
𝐿(𝑦,̂ 𝑦) = − ∑ 𝑦𝑗 log(𝑦𝑗̂ )
𝑗=1
where 𝑦𝑗̂ represents the network’s predicted probability for digit j. Given our
one-hot encoding, this simplifies to:
𝐿(𝑦,̂ 𝑦) = − log(𝑦𝑐̂ )
Chapter 3. DL Primer 89
where 𝑐 is the index of the correct class. This means the loss depends only on
the predicted probability for the correct digit—the network is penalized based
on how confident it is in the right answer.
For example, if our network predicts the following probabilities for an image
of “7”:
Predicted: [0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8, 0.0, 0.1]
True: [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
1 𝐵 10
𝐿batch = − ∑ ∑ 𝑦 log(𝑦𝑖𝑗
̂ )
𝐵 𝑖=1 𝑗=1 𝑖𝑗
For our MNIST example with a batch size of 32, this means:
• Processing 32 sets of 10 probabilities
• Computing 32 individual loss values
• Averaging these values to produce the final batch loss
3.5. Learning Process 90
Video 2 and Video 3 give a good high level overview of cost functions help
neural networks learn
TV Watch on YouTube
Watch on YouTube
Gradient descent – Part 1
çĖ Important 3: Gradient descent – Part 2
TV Watch on YouTube
from the chain rule of calculus but must be implemented efÏciently for practical
neural network training.
At each layer 𝑙, we compute three main gradient components:
1. Weight Gradients:
𝜕𝐿 𝜕𝐿 (𝑙−1) 𝑇
= A
𝜕W(𝑙) 𝜕Z(𝑙)
2. Bias Gradients:
𝜕𝐿 𝜕𝐿
=
𝜕b(𝑙) 𝜕Z(𝑙)
3. Input Gradients (for propagating to previous layer):
𝜕𝐿 𝑇 𝜕𝐿
= W(𝑙)
𝜕A(𝑙−1) 𝜕Z(𝑙)
In our MNIST example, consider the final layer where the network outputs
digit probabilities. If the network predicted [0.1, 0.2, 0.5, … , 0.05] for an image
of “7”, the gradient computation would:
1. Start with the error in these probabilities
2. Compute how weight adjustments would affect this error
3. Propagate these gradients backward to help adjust earlier layer weights
TV Watch on YouTube
1 𝐵
∇𝜃 𝐿batch = ∑∇ 𝐿
𝐵 𝑖=1 𝜃 𝑖
In our MNIST training, with a typical batch size of 32, this means:
1. Process 32 images through forward propagation
3.5. Learning Process 94
forward
Figure 3.17: Comparing training ver- "person"
sus inference data flow and compu-
tation. error
Large N backward
Training
forward
"person"
Smaller,
varied N
Inference
Table 3.5: Key differences between training and inference phases in neural net-
works.
Aspect Training Inference
Computation Flow Forward and backward passes, Forward pass only, direct input to
gradient computation output
Parameters Continuously updated weights and Fixed/frozen weights and biases
biases
Processing Pattern Iterative loops over multiple epochs Single pass through the network
Memory Requirements High – stores activations, gradients, Lower– stores only model | parameters
optimizer state and current input
Chapter 3. DL Primer 97
This stark contrast between training and inference phases highlights why
system architectures often differ significantly between development and de-
ployment environments. While training requires substantial computational
resources and specialized hardware, inference can be optimized for efÏciency
and deployed across a broader range of devices.
The key thing to notice from the figure is that machine learning systems op-
erate as hybrid architectures that combine conventional computing operations
with neural network computations. The neural network component, focused on
learned transformations through matrix operations, represents just one element
within a broader computational framework. This framework encompasses
both the preparation of input data and the interpretation of network outputs,
processes that rely primarily on traditional computing methods.
Consider how data flows through the pipeline in Figure 3.18:
1. Raw inputs arrive in their original form, which might be images, text,
sensor readings, or other data types
2. Pre-processing transforms these inputs into a format suitable for neural
network consumption
3. The neural network performs its learned transformations
4. Raw outputs emerge from the network, often in numerical form
5. Post-processing converts these outputs into meaningful, actionable results
The hybrid nature of this architecture has significant implications for system
implementation. While neural network computations may benefit from spe-
cialized hardware accelerators, pre- and post-processing operations typically
execute on conventional processors. This distribution of computation across
heterogeneous hardware resources represents a fundamental consideration in
system design.
3.6.2 Pre-processing
The pre-processing stage transforms raw inputs into a format suitable for neural
network computation. While often overlooked in theoretical discussions, this
stage forms a critical bridge between real-world data and neural network oper-
ations. Consider our MNIST digit recognition example: before a handwritten
digit image can be processed by the neural network we designed earlier, it must
undergo several transformations. Raw images of handwritten digits arrive in
various formats, sizes, and pixel value ranges. For instance, in Figure 3.19, we
see that the digits are all of different sizes, and even the number 6 is written
differently by the same person.
3.6.3 Inference
The inference phase represents the operational state of a neural network, where
learned parameters are used to transform inputs into predictions. Unlike the
training phase we discussed earlier, inference focuses solely on forward com-
putation with fixed parameters.
In total, the network requires storage for 89,610 learned parameters (89,400
weights plus 210 biases). Beyond these fixed parameters, memory must also
be allocated for intermediate activations during forward computation. For
processing a single image, this means allocating space for:
• First hidden layer activations: 100 values
• Second hidden layer activations: 100 values
• Output layer activations: 10 values
This memory allocation pattern differs significantly from training, where
additional memory was needed for gradients, optimizer states, and backpropa-
gation computations.
3.6.4 Post-processing
The transformation of neural network outputs into actionable predictions re-
quires a return to traditional computing paradigms. Just as pre-processing
bridges real-world data to neural computation, post-processing bridges neural
outputs back to conventional computing systems. This completes the hybrid
computing pipeline we examined earlier, where neural and traditional comput-
ing operations work in concert to solve real-world problems.
The complexity of post-processing extends beyond simple mathematical
transformations. Real-world systems must handle uncertainty, validate out-
puts, and integrate with larger computing systems. In our MNIST example, a
digit recognition system might require not just the most likely digit, but also
confidence measures to determine when human intervention is needed. This
introduces additional computational steps: confidence thresholds, secondary
prediction checks, and error handling logic, all of which are implemented in
traditional computing frameworks.
The computational requirements of post-processing differ significantly from
neural network inference. While inference benefits from parallel processing and
specialized hardware, post-processing typically runs on conventional CPUs and
follows sequential logic patterns. This return to traditional computing brings
both advantages and constraints. Operations are more flexible and easier to
modify than neural computations, but they may become bottlenecks if not
Chapter 3. DL Primer 103
Data collection presented the first major challenge. Unlike controlled labora-
tory environments, postal facilities needed to process mail pieces with tremen-
dous variety. The training dataset had to capture this diversity. Digits written
by people of different ages, educational backgrounds, and writing styles formed
just part of the challenge. Envelopes came in varying colors and textures, and
images were captured under different lighting conditions and orientations. This
extensive data collection effort later contributed to the creation of the MNIST
database we’ve used in our examples.
The network architecture design required balancing multiple constraints.
While deeper networks might achieve higher accuracy, they would also increase
processing time and computational requirements. Processing 28 × 28 pixel
images of individual digits needed to complete within strict time constraints
while running reliably on available hardware. The network had to maintain con-
sistent accuracy across varying conditions, from well-written digits to hurried
scrawls.
Training the network introduced additional complexity. The system needed
to achieve high accuracy not just on a test dataset, but on the endless vari-
ety of real-world handwriting styles. Careful preprocessing normalized input
images to account for variations in size and orientation. Data augmentation
techniques increased the variety of training samples. The team validated perfor-
mance across different demographic groups and tested under actual operating
conditions to ensure robust performance.
Chapter 3. DL Primer 105
3.8 Conclusion
3.9 Resources
Slides
çĖ Videos
• Video 1
• Video 2
• Video 3
• Video 4
¸Î Exercises
Coming soon.
Chapter 4
DNN Architectures
Purpose
What recurring patterns emerge across modern deep learning architectures, and how
do these patterns enable systematic approaches to AI system design?
Deep learning architectures represent a convergence of computational pat-
terns that form the building blocks of modern AI systems. These foundational
patterns, ranging from convolutional structures to attention mechanisms, reveal
how complex models arise from simple, repeatable components. The exam-
ination of these architectural elements provides insights into the systematic
construction of flexible, efÏcient AI systems, establishing core principles that
influence every aspect of system design and deployment. These structural in-
sights illuminate the path toward creating scalable, adaptable solutions across
diverse application domains.
109
4.1. Overview 110
L Learning Objectives
4.1 Overview
Deep learning architecture stands for specific representation or organizations
of neural network components, the neurons, weights, and connections (as
introduced in Chapter 3), arranged to efÏciently process different types of
patterns in data. While the previous chapter established the fundamental
building blocks of neural networks, in this chapter we examine how these
components are structured into architectures that map efÏciently to computer
systems.
Neural network architectures have evolved to address specific pattern process-
ing challenges. Whether processing arbitrary feature relationships, exploiting
spatial patterns, managing temporal dependencies, or handling dynamic infor-
mation flow, each architectural pattern emerged from particular computational
needs. These architectures, from a computer systems perspective, require an
examination of how their computational patterns map to system resources.
Most often the architectures are discussed in terms of their algorithmic struc-
tures (MLPs, CNNs, RNNs, Transformers). However, in this chapter we take
a more fundamental approach by examining how their computational pat-
terns map to hardware resources. Each section analyzes how specific pattern
processing needs influence algorithmic structure and how these structures
map to computer system resources. The implications for computer system
design require examining how their computational patterns map to hardware
resources. The mapping from algorithmic requirements to computer system
design involves several key considerations:
1. Memory access patterns: How data moves through the memory hierarchy
2. Computation characteristics: The nature and organization of arithmetic
operations
3. Data movement: Requirements for on-chip and off-chip data transfer
4. Resource utilization: How computational and memory resources are
allocated
For example, dense connectivity patterns generate different memory band-
width demands than localized processing structures. Similarly, stateful process-
Chapter 4. DNN Architectures 111
Neuron
Weighted Edge
Weighted Edge
Input Layer Hidden Layer Output Layer Input Layer Hidden Layer Output Layer
H = activation(Z)
return H
Figure 4.3: How convolutional neu- Input Pooling Pooling Pooling Output
SoftMax Activation
Convolution Convolution Convolution Function
Kernel + + +
ReLU ReLU ReLU
Feature Maps Flatten
Layer
each output channel (for k loop), which represents different learned features
or patterns—our 32 different feature detectors.
The inner three loops implement the actual convolution operation at each
position. For each output value, we process a local 3 × 3 region of the input (the
dy and dx loops) across all input channels (for c loop). This creates a sliding
window effect, where the same 3 × 3 filter moves across the image, performing
multiply-accumulates between the filter weights and the local input values.
Unlike the MLP’s global connectivity, this local processing pattern means each
output value depends only on a small neighborhood of the input.
For our MNIST example with 3 × 3 filters and 32 output channels, each out-
put position requires only 9 multiply-accumulate operations per input channel,
compared to the 784 operations needed in our MLP layer. However, this oper-
ation must be repeated for every spatial position (28 × 28) and every output
channel (32).
Chapter 4. DNN Architectures 119
While using fewer operations per output, the spatial structure creates dif-
ferent patterns of memory access and computation that systems must handle
efÏciently. These patterns fundamentally influence system design, creating
both challenges and opportunities for optimization, which we’ll examine next.
update this state based on new inputs, and learn which historical information
is relevant for current predictions. Unlike MLPs and CNNs, which process
fixed-size inputs, sequential processing must handle variable-length sequences
while maintaining computational efÏciency. This leads us to the recurrent
neural network (RNN) architecture.
where h𝑡 corresponds to the hidden state at time 𝑡, x𝑡 is the input at time 𝑡, Wℎℎ
contains the recurrent weights, and W𝑥ℎ contains the input weights, as shown
in the unfolded network structure in Figure 4.5.
For example, in processing a sequence of words, each word might be repre-
sented as a 100-dimensional vector (x𝑡 ), and we might maintain a hidden state
of 128 dimensions (h𝑡 ). At each time step, the network combines the current
input with its previous state to update its understanding of the sequence. This
creates a form of memory that can capture patterns across time steps.
This recurrent structure directly implements our requirements for sequential
processing through the introduction of recurrent connections, which maintain
internal state and allow the network to carry information forward in time.
Instead of processing all inputs independently, RNNs process sequences of
data by iteratively updating a hidden state based on the current input and
the previous hidden state, as depicted in Figure 4.5. This makes RNNs well-
suited for tasks such as language modeling, speech recognition, and time-series
forecasting.
y yt − 1 yt yt + 1
x xt − 1 xt xt + 1
4.4. Recurrent Neural Networks: Sequential Pattern Processing 122
This simplified view masks the underlying complexity of the nested loops and
individual computations shown in the detailed implementation (Listing 4.6).
Its actual implementation reveals a more detailed computational reality.
The nested loops in rnn_layer_compute expose the core computational pat-
tern of RNNs (see Listing 4.6). Loop 1 processes each sequence in the batch in-
dependently, allowing for batch-level parallelism. Within each batch item, Loop
2 computes how the previous hidden state influences the next state through
the recurrent weights W_hh. Loop 3 then incorporates new information from
the current input through the input weights W_xh. Finally, Loop 4 adds biases
and applies the activation function to produce the new hidden state.
For a sequence processing task with input dimension 100 and hidden state di-
mension 128, each time step requires two matrix multiplications: one 128 × 128
for the recurrent connection and one 100 × 128 for the input projection. While
individual time steps can process in parallel across batch elements, the time
steps themselves must process sequentially. This creates a unique computa-
tional pattern that systems must handle efÏciently.
Chapter 4. DNN Architectures 123
return h_t
QK𝑇
Attention(Q, K, V) = softmax ( )V
√𝑑𝑘
The student didnt finish the homework because they were tired.
Layer: 4 Head: 2
Figure 4.6: Transformer architec-
tures “attend” or identify pairwise
relationships with subwords in a se-
The_ The_
quence.
student_ student_
didn_ didn_
’_ ’_
t_ t_
finish_ finish_
the_ the_
homework_ homework_
because_ because_
they_ they_
were_ were_
tired_ tired_
return outputs
each sequence in the batch, three sets of projection matrices for queries, keys,
and values (each sized 𝑑 × 𝑑), and input and output feature maps of size 𝑁 × 𝑑.
The dynamic generation of attention weights for every input creates a memory
access pattern where intermediate attention weights become a significant factor
in memory usage.
Computation Needs. Computation needs in attention mechanisms center
around two main phases: generating attention weights and applying them
to values. For each attention layer, the system performs substantial multiply-
accumulate operations across multiple computational stages. The query-key
interactions alone require 𝑁 × 𝑁 × 𝑑 multiply-accumulates, with an equal num-
ber needed for applying attention weights to values. Additional computations
are required for the projection matrices and softmax operations. This computa-
tional pattern differs from previous architectures due to its quadratic scaling
with sequence length and the need to perform fresh computations for each
input.
Data Movement. Data movement in attention mechanisms presents unique
challenges. Each attention operation involves projecting and moving query,
key, and value vectors for each position, storing and accessing the full attention
weight matrix, and coordinating the movement of value vectors during the
weighted combination phase. This creates a data movement pattern where
intermediate attention weights become a major factor in system bandwidth
requirements. Unlike the more predictable access patterns of CNNs or the
sequential access of RNNs, attention operations require frequent movement of
dynamically computed weights across the memory hierarchy.
These distinctive characteristics of attention mechanisms in terms of memory,
computation, and data movement have significant implications for system de-
sign and optimization, setting the stage for the development of more advanced
architectures like Transformers.
XWQ (XWK )𝑇
SelfAttention(X) = softmax ( ) XWV
√𝑑𝑘
Output
Probabilities
Feed
Forward
Positional Positional
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
it allows the model to capture different types of relationships within the same
input, enhancing the model’s representational power.
Fourth, the core computations in self-attention are dominated by large ma-
trix multiplications. For a sequence of length 𝑁 and embedding dimension 𝑑,
the main operations involve matrices of sizes (𝑁 × 𝑑), (𝑑 × 𝑑), and (𝑁 × 𝑁 ).
These intensive matrix operations are well-suited for acceleration on special-
ized hardware like GPUs, but they also contribute significantly to the overall
computational cost of the model.
Finally, self-attention generates memory-intensive intermediate results. The
attention weights matrix (𝑁 ×𝑁 ) and the intermediate results for each attention
head create substantial memory requirements, especially for long sequences.
This can pose challenges for deployment on memory-constrained devices and
necessitates careful memory management in implementations.
These computational patterns create a unique profile for Transformer self-
attention, distinct from previous architectures. The parallel nature of the com-
putations makes Transformers well-suited for modern parallel processing hard-
ware, but the quadratic complexity with sequence length poses challenges
for processing long sequences. As a result, much research has focused on
4.6. Architectural Building Blocks 132
Q = matmul(X, W_Q)
K = matmul(X, W_K)
V = matmul(X, W_V)
return output
def multi_head_attention(
X, W_Q, W_K, W_V, W_O, num_heads, d_k
):
outputs = []
for i in range(num_heads):
head_output = self_attention_layer(
X, W_Q[i], W_K[i], W_V[i], d_k
)
outputs.append(head_output)
return final_output
These building blocks and their evolution provide insight into modern ar-
chitectures. What began with the simple perceptron (Rosenblatt 1958) evolved
into multi-layer networks (Rumelhart, Hinton, and Williams 1986), which then
spawned specialized patterns for spatial and sequential processing. Each ad-
vancement maintained useful elements from its predecessors while introducing
new computational primitives. Today’s sophisticated architectures, like Trans-
formers, can be seen as carefully engineered combinations of these fundamental
building blocks.
This progression reveals not just the evolution of neural networks, but also the
discovery and refinement of core computational patterns that remain relevant.
As we have seen through our exploration of different neural network archi-
tectures, deep learning has evolved significantly, with each new architecture
bringing its own set of computational demands and system-level challenges.
Table 4.1 summarizes this evolution, highlighting the key primitives and
system focus for each era of deep learning development. This table encapsulates
the major shifts in deep learning architecture design and the corresponding
changes in system-level considerations. From the early focus on dense matrix
operations optimized for CPUs, we see a progression through convolutions
leveraging GPU acceleration, to sequential operations necessitating sophisti-
cated memory hierarchies, and finally to the current era of attention mechanisms
requiring flexible accelerators and high-bandwidth memory.
Table 4.1: Evolution of deep learning architectures and their system implica-
tions
Dominant
Era Architecture Key Primitives System Focus
Early NN MLP Dense Matrix Ops CPU optimization
CNN Revolution CNN Convolutions GPU acceleration
Sequence RNN Sequential Ops Memory hierarchies
Modeling
Attention Era Transformer Attention, Dynamic Flexible accelerators, High-bandwidth
Compute memory
As we dive deeper into each of these building blocks, we see how these
primitives evolved and combined to create increasingly powerful and complex
neural network architectures.
F(x) + x
This composition of building blocks creates something greater than the sum
of its parts. The self-attention mechanism, while building on previous attention
concepts, enables a new form of dynamic pattern processing. The arrangement
of these components, attention followed by feedforward layers, with skip con-
nections and normalization, has proven so effective it’s become a template for
new architectures.
Even recent innovations in vision and language models follow this pattern
of recombining fundamental building blocks. Vision Transformers adapt the
Transformer architecture to images while maintaining its essential compo-
nents (Dosovitskiy et al. 2021). Large language models scale up these patterns
while introducing refinements like grouped-query attention or sliding window
attention, yet still rely on the core building blocks established through this
4.7. System-Level Building Blocks 136
architectural evolution (T. B. Brown, Mann, Ryder, Subbiah, Kaplan, and al.
2020).
To illustrate how these modern architectures synthesize and innovate upon
previous approaches, consider the following comparison of primitive utilization
across different neural network architectures:
8
A technique in signal process- 4.7.1 Core Computational Primitives
ing and computer vision where a
window moves across data, comput- Three fundamental operations serve as the building blocks for all deep learning
ing results from subsets, essential in computations: matrix multiplication, sliding window operations8 , and dynamic
CNNs. computation9 . What makes these operations primitive is that they cannot be
further decomposed without losing their essential computational properties
9
Computational processes and efÏciency characteristics.
where the operations adjust based Matrix multiplication represents the most basic form of transforming sets of
on input data, used prominently in features. When we multiply a matrix of inputs by a matrix of weights, we’re
machine learning models like the computing weighted combinations, which is the fundamental operation of
Transformer. neural networks. For example, in our MNIST network, each 784-dimensional
Chapter 4. DNN Architectures 137
input vector multiplies with a 784 × 100 weight matrix. This pattern appears
everywhere: MLPs use it directly for layer computations, CNNs reshape convo-
lutions into matrix multiplications (turning a 3 × 3 convolution into a matrix
operation, as illustrated in Figure 4.11), and Transformers use it extensively in
their attention mechanisms.
1
1 2 3
2 Figure 4.11: Depiction of how
Transformed GEMM im2col can map a convolution into a
4 5 6 1 2
3 dense matrix multiplication for bet-
1 2 4 5 10 11 13 14 ter efÏciency.
7 8 9 3 4
4
2 3 5 6 11 12 14 15
Input feature maps × 5
4 5 7 8 13 14 16 17
10 11 12 5 6 6
5 6 8 9 14 15 17 18
13 14 15 7 8 7
16 17 18 Filter Kernels 8
Table 4.3: DNN architecture complexity. Note that for RNNs, parameter storage
is bounded by 𝑂(𝑁 × ℎ) when 𝑁 > ℎ.
Input Scaling
Architecture Dependency Parameter Storage Activation Storage Behavior
MLP Linear 𝑂(𝑁 × 𝑊 ) 𝑂(𝐵 × 𝑊 ) Predictable
CNN Constant 𝑂(𝐾 × 𝐶) 𝑂(𝐵 × 𝐻img × 𝑊img ) EfÏcient
RNN Linear 𝑂(ℎ2 ) 𝑂(𝐵 × 𝑇 × ℎ) Challenging
Transformer Quadratic 𝑂(𝑁 × 𝑑) 𝑂(𝐵 × 𝑁 2 ) Problematic
Where:
• 𝑁: Input or sequence size
• 𝑊: Layer width
• 𝐵: Batch size
• 𝐾: Kernel size
• 𝐶: Number of channels
• 𝐻img : Height of input feature map (CNN)
• 𝑊img : Width of input feature map (CNN)
• ℎ: Hidden state size (RNN)
• 𝑇: Sequence length
• 𝑑: Model dimensionality
Table 4.3 reveals how memory requirements scale with different architec-
tural choices. The quadratic scaling of activation storage in Transformers, for
instance, highlights the need for large memory capacities and efÏcient mem-
ory management in systems designed for Transformer-based workloads. In
contrast, CNNs exhibit more favorable memory scaling due to their parameter
sharing and localized processing. These memory complexity considerations are
crucial when making system-level design decisions, such as choosing memory
hierarchy configurations and developing memory optimization strategies.
4.7. System-Level Building Blocks 140
The impact of these patterns becomes clearer when we consider data reuse
opportunities. In CNNs, each input pixel participates in multiple convolution
windows (typically 9 times for a 3 × 3 filter), making effective data reuse funda-
mental for performance. Modern GPUs provide multi-level cache hierarchies
(L1, L2, shared memory) to capture this reuse, while software techniques like
loop tiling ensure data remains in cache once loaded.
Working set size, the amount of data needed simultaneously for computa-
tion, varies dramatically across architectures. An MLP layer processing MNIST
images might need only a few hundred KB (weights plus activations), while
a Transformer processing long sequences can require several MB just for stor-
ing attention patterns. These differences directly influence hardware design
choices, like the balance between compute units and on-chip memory, and soft-
ware optimizations like activation checkpointing or attention approximation
techniques.
Having a good grasp of these memory access patterns is essential as archi-
tectures evolve. The shift from CNNs to Transformers, for instance, has driven
the development of hardware with larger on-chip memories and more sophisti-
cated caching strategies to handle increased working sets and more dynamic
access patterns. Future architectures will likely continue to be shaped by their
memory access characteristics as much as their computational requirements.
4.8 Conclusion
4.9 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 5
AI Workflow
Purpose
What are the diverse elements of AI systems and how do we combine to create effective
machine learning system solutions?
The creation of practical AI solutions requires the orchestration of multiple
components into coherent workflows. Workflow design highlights the connec-
tions and interactions that animate these components. This systematic perspec-
tive reveals how data flow, model training, and deployment considerations are
intertwined to form robust AI systems. Analyzing these interconnections offers
important insights into system-level design choices, establishing a framework
for understanding how theoretical concepts can be translated into deployable
solutions that meet real-world needs.
145
5.1. Overview 146
L Learning Objectives
5.1 Overview
The machine learning lifecycle is a systematic, interconnected process that
guides the transformation of raw data into actionable models deployed in real-
world applications. Each stage builds upon the outcomes of the previous one,
creating an iterative cycle of refinement and improvement that supports robust,
scalable, and reliable systems.
Figure 5.2 illustrates the lifecycle as a series of stages connected through
continuous feedback loops. The process begins with data collection, which
ensures a steady input of raw data from various sources. The collected data
progresses to data ingestion, where it is prepared for downstream machine
learning applications. Subsequently, data analysis and curation involve inspect-
ing and selecting the most appropriate data for the task at hand. Following this,
data labeling and data validation, which nowadays involves both humans and
AI itself, ensure that the data is properly annotated and verified for usability
before advancing further.
Online ML ready
Performance Datasets
ML System ML System
ML Model
Deployment Validation Model Training
Certificate Evaluation
Deploy ML Validate ML Use ML algos to
Online Validated KPIs Compute model Models
system to ML System
system for create models
ML System KPIs
production deployment
The data then enters the preparation stage, where it is transformed into
machine learning-ready datasets through processes such as splitting and ver-
sioning. These datasets are used in the model training stage, where machine
Chapter 5. AI Workflow 147
5.1.1 Definition
The machine learning (ML) lifecycle is a structured, iterative process that guides
the development, evaluation, and continual improvement of machine learning
systems. Integrating ML into broader software engineering practices introduces
5.1. Overview 148
These differences underline the need for a robust ML lifecycle framework that
can accommodate iterative development, dynamic behavior, and data-driven
decision-making. This lifecycle ensures that machine learning systems remain
effective not only at launch but throughout their operational lifespan, even as
environments evolve.
Data Model
Problem Evaluation Deployment & Monitoring &
Collection & Development
Definition & Validation Integration Maintenance
Preparation & Training
Figure 5.3: ML lifecycle overview.
Feedback Loop
lifecycle. This chapter focuses on the overview, with subsequent chapters diving
into the implementation aspects of each stage.
Problem Definition and Requirements: The first stage involves clearly defin-
ing the problem to be solved, establishing measurable performance objectives,
and identifying key constraints. Precise problem definition ensures alignment
between the system’s goals and the desired outcomes.
Data Collection and Preparation: This stage includes gathering relevant
data, cleaning it, and preparing it for model training. This process often in-
volves curating diverse datasets, ensuring high-quality labeling, and developing
preprocessing pipelines to address variations in the data.
Model Development and Training: In this stage, researchers select appro-
priate algorithms, design model architectures, and train models using the
prepared data. Success depends on choosing techniques suited to the problem
and iterating on the model design for optimal performance.
Evaluation and Validation: Evaluation involves rigorously testing the model’s
performance against predefined metrics and validating its behavior in different
scenarios. This stage ensures the model is not only accurate but also reliable
and robust in real-world conditions.
Deployment and Integration: Once validated, the trained model is inte-
grated into production systems and workflows. This stage requires addressing
practical challenges such as system compatibility, scalability, and operational
constraints.
Monitoring and Maintenance: The final stage focuses on continuously mon-
itoring the system’s performance in real-world environments and maintaining
or updating it as necessary. Effective monitoring ensures the system remains
relevant and accurate over time, adapting to changes in data, requirements, or
external conditions.
A Case Study in Medical AI: To further ground our discussion on these
stages, we will explore Google’s Diabetic Retinopathy (DR) screening project as
a case study. This project exemplifies the transformative potential of machine
learning in medical imaging analysis, an area where the synergy between algo-
rithmic innovation and robust systems engineering plays a pivotal role. Building
upon the foundational work by Gulshan et al. (2016), which demonstrated the
effectiveness of deep learning algorithms in detecting diabetic retinopathy from
retinal fundus photographs, the project progressed from research to real-world
deployment, revealing the complex challenges that characterize modern ML
systems.
Diabetic retinopathy, a leading cause of preventable blindness worldwide,
can be detected through regular screening of retinal photographs. Figure 5.4
illustrates examples of such images: (A) a healthy retina and (B) a retina with
diabetic retinopathy, marked by hemorrhages (red spots). The goal is to train a
model to detect the hemorrhages.
Chapter 5. AI Workflow 151
2
Hard Exudates: Deposits of 5.3.2 Definition Workflow
lipids or fats indicative of leakage Establishing clear and actionable problem definitions involves a multi-step
from impaired retinal blood vessels. workflow that bridges technical, operational, and user considerations. The
process begins with identifying the core objective of the system—what tasks
it must perform and what constraints it must satisfy. Teams collaborate with
stakeholders to gather domain knowledge, outline requirements, and anticipate
challenges that may arise in real-world deployment.
In the DR project, this phase involved close collaboration with clinicians to
determine the diagnostic needs of rural clinics. Key decisions, such as balancing
model complexity with hardware limitations and ensuring interpretability
for healthcare providers, were made during this phase. The team’s iterative
approach also accounted for regulatory considerations, such as patient privacy
and compliance with healthcare standards. This collaborative process ensured
that the problem definition aligned with both technical feasibility and clinical
relevance.
protocols all play critical roles in ensuring that collected data aligns with both
technical and operational goals.
proactive measures ensured that low-quality data was not propagated through
the pipeline.
Validation systems extended these efforts by verifying not just image quality
but also proper labeling, patient association, and compliance with privacy reg-
ulations. Operating at both local and centralized levels, these systems ensured
data reliability and robustness, safeguarding the integrity of the entire ML
pipeline.
Model Updates
Figure 5.5: Feedback loops and de-
pendencies between stages in the Model Deployment Constraints Model
ML lifecycle. Training Deployment
Performance Insights
Figure 5.5 illustrates the key feedback loops that characterize the ML lifecycle,
with particular relevance to data collection and preparation. Looking at the
left side of the diagram, we see how monitoring and maintenance activities
feed back to both data collection and preparation stages. For example, when
monitoring reveals data quality issues in production (shown by the “Data Qual-
ity Issues” feedback arrow), this triggers refinements in our data preparation
pipelines. Similarly, performance insights from deployment might highlight
gaps in our training data distribution (indicated by the “Performance Insights”
loop back to data collection), prompting the collection of additional data to
cover underrepresented cases. In the DR project, this manifested when mon-
itoring revealed that certain demographic groups were underrepresented in
the training data, leading to targeted data collection efforts to improve model
fairness and accuracy across all populations.
Feedback loops are another critical aspect of this lifecycle perspective. In-
sights from model performance often lead to adjustments in data collection
strategies, creating an iterative improvement process. For example, in the DR
project, patterns observed during model evaluation influenced updates to pre-
processing pipelines, ensuring that new data aligned with the system’s evolving
requirements.
Chapter 5. AI Workflow 157
Model development was inherently iterative, with each cycle, involving ad-
justments to DNN architectures, refinements of hyperparameters, or incor-
porations of new data, producing extensive metadata, including checkpoints,
validation results, and performance metrics. Managing this information across
the team required robust tools for experiment tracking and version control to
ensure that progress remained organized and reproducible.
5.6 Deployment
Once validated, the trained model is integrated into production systems and
workflows. Deployment requires addressing practical challenges such as system
compatibility, scalability, and operational constraints. Successful integration
hinges on ensuring that the model’s predictions are not only accurate but also
actionable in real-world settings, where resource limitations and workflow
disruptions can pose significant barriers.
In the DR project, deployment strategies were shaped by the diverse envi-
ronments in which the system would operate. Edge deployment enabled local
processing of retinal images in rural clinics with intermittent connectivity, while
automated quality checks flagged poor-quality images for recapture, ensuring
reliable predictions. These measures demonstrate how deployment must bridge
technological sophistication with usability and scalability across varied clinical
settings.
for edge cases or non-critical diagnostics. These behaviors, which were not
predicted during development, necessitated adjustments to both the system’s
operational focus and its training programs.
Deployment introduces significant resource dependencies. Running ML
models on edge devices required balancing computational efÏciency with accu-
racy, while ensuring other clinic operations were not disrupted. These trade-offs
extended to the broader system, influencing everything from hardware require-
ments to scheduling updates without affecting clinical workflows.
The boundaries between deployment and other lifecycle stages are fluid.
Optimization efforts for edge devices often overlapped with model develop-
ment, while training programs for clinicians fed directly into monitoring and
maintenance. Navigating these overlaps required clear communication and col-
laboration between teams, ensuring seamless integration and ongoing system
adaptability.
By applying a systems perspective to deployment and integration, we can
better anticipate challenges, design robust solutions, and maintain the flexibility
needed to adapt to evolving operational and technical demands. This approach
ensures that ML systems not only achieve initial success but remain effective
and reliable in real-world applications.
5.7 Maintenance
Monitoring and maintenance represent the ongoing, critical processes that
ensure the continued effectiveness and reliability of deployed machine learning
systems. Unlike traditional software, ML systems must account for shifts in data
distributions, changing usage patterns, and evolving operational requirements.
Monitoring provides the feedback necessary to adapt to these challenges, while
maintenance ensures the system evolves to meet new needs.
As shown in Figure 5.5, monitoring serves as a central hub for system im-
provement, generating three critical feedback loops: “Performance Insights”
flowing back to data collection to address gaps, “Data Quality Issues” triggering
refinements in data preparation, and “Model Updates” initiating retraining
when performance drifts. In the DR project, these feedback loops enabled
continuous system improvement, from identifying underrepresented patient
demographics (triggering new data collection) to detecting image quality issues
(improving preprocessing) and addressing model drift (initiating retraining).
For DR screening, continuous monitoring tracked system performance across
diverse clinics, detecting issues such as changing patient demographics or
new imaging technologies that could impact accuracy. Proactive maintenance
included plans to incorporate 3D imaging modalities like OCT, expanding the
system’s capabilities to diagnose a wider range of conditions. This highlights
the importance of designing systems that can adapt to future challenges while
maintaining compliance with rigorous healthcare regulations.
Even the system’s user interface was influenced, needing to present monitoring
data in a clear, actionable manner for clinical and technical staff alike.
5.8.1 Collaboration in AI
At the heart of any AI project is a team of data scientists. These innovative
thinkers focus on model creation, experiment with architectures, and refine
the algorithms that will become the neural networks driving insights from
data. In our DR project, data scientists were instrumental in architecting neural
networks capable of identifying retinal anomalies, advancing through iterations
to fine-tune a balance between accuracy and computational efÏciency.
Behind the scenes, data engineers work tirelessly to design robust data
pipelines, ensuring that vast amounts of data are ingested, transformed, and
stored effectively. They play a crucial role in the DR project, handling data from
various clinics and automating quality checks to guarantee that the training
inputs were standardized and reliable.
Meanwhile, machine learning engineers take the baton to integrate these
models into production settings. They guarantee that models are nimble, scal-
able, and fit the constraints of the deployment environment. In rural clinics
where computational resources can be scarce, their work in optimizing models
was pivotal to enabling on-the-spot diagnosis.
Domain experts, such as ophthalmologists in the DR project, infuse tech-
nical progress with practical relevance. Their insights shape early problem
definitions and ensure that AI tools align closely with real-world needs, offer-
ing a measure of validation that keeps the outcome aligned with clinical and
operational realities.
MLOps engineers are the guardians of workflow automation, orchestrating
the continuous integration and monitoring systems that keep AI models up and
running. They crafted centralized monitoring frameworks in the DR project,
ensuring that updates were streamlined and model performance remained
optimal across different deployment sites.
Ethicists and compliance ofÏcers remind us of the larger responsibility that
accompanies AI deployment, ensuring adherence to ethical standards and legal
5.9. Conclusion 168
5.9 Conclusion
The AI workflow we’ve explored, while illustrated through the Diabetic Retinopa-
thy project, represents a framework applicable across diverse domains of AI
application. From finance and manufacturing to environmental monitoring
and autonomous vehicles, the core stages of the workflow remain consistent,
even as their specific implementations vary widely.
The interconnected nature of the AI lifecycle, illustrated in Figure 5.5, is a
universal constant. The feedback loops, from “Performance Insights” driving
data collection to “Validation Issues” triggering model updates, demonstrate
how decisions in one stage invariably impact others. Data quality affects model
performance, deployment constraints influence architecture choices, and real-
world usage patterns drive ongoing refinement through these well-defined
feedback paths.
Regardless of the application, the interconnected nature of the AI lifecycle is
a universal constant. Whether developing fraud detection systems for banks
or predictive maintenance models for industrial equipment, decisions made
in one stage invariably impact others. Data quality affects model performance,
deployment constraints influence architecture choices, and real-world usage
patterns drive ongoing refinement.
This interconnectedness underscores the importance of systems thinking in
AI development across all sectors. Success in AI projects, regardless of domain,
Chapter 5. AI Workflow 169
5.10 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 6
Data Engineering
Purpose
How does data shape ML systems engineering?
In the field of machine learning, data engineering is often overshadowed
by the allure of sophisticated algorithms, when in fact data plays a founda-
tional role in determining an AI system’s capabilities and limitations. We need
to understand the core principles of data in ML systems, exploring how the
acquisition, processing, storage, and governance of data directly impact the
performance, reliability, and ethical considerations of AI systems. By under-
standing these fundamental concepts, we can unlock the true potential of AI
and build a solid foundation of high-quality ML solutions.
171
6.1. Overview 172
L Learning Objectives
6.1 Overview
Data is the foundation of modern machine learning systems, as success is gov-
erned by the quality and accessibility of training and evaluation data. Despite
its pivotal role, data engineering is often overlooked compared to algorithm
design and model development. However, the effectiveness of any machine
learning system hinges on the robustness of its data pipeline. As machine
learning applications become more sophisticated, the challenges associated
with curating, cleaning, organizing, and storing data have grown significantly.
These activities have emerged as some of the most resource-intensive aspects
of the data engineering process, requiring sustained effort and attention.
early in the pipeline to avoid downstream issues and ensure the effectiveness
of machine learning systems.
ab ction
lea sis
lec el
tra odel
ati l
ym el
alu de
t
on
t
se Mod
plo Mod
en
en
d c aly
elin
nin
tio
inin
ev Mo
Sta robl
Impacts of cascades
M
an colle
an a an
P
de
Da
Da
and system latency, which offer measurable outcomes to gauge progress and
success.
Throughout this process, engaging with stakeholders, including end-users
and business leaders, provides invaluable insights that ensure the project re-
mains aligned with real-world needs and expectations.
In particular, a cardinal sin in ML is to begin collecting data (or augmenting an
existing dataset) without clearly specifying the underlying problem definition
to guide the data collection. We identify the key steps that should precede any
data collection effort here:
1. Identify and clearly state the problem definition
2. Set clear objectives to meet
3. Establish success benchmarks
4. Understand end-user engagement/use
5. Understand the constraints and limitations of deployment
6. Perform data collection.
7. Iterate and refine.
• Processing Power: The computational capabilities of embedded de- sensors and manages wake-up func-
vices are limited (a few hundred MHz of clock speed), so the KWS tions. It enables efÏcient power
project’s lifecycle. By carefully considering each aspect, from core problem iden-
tification, through performance benchmarks, to deployment constraints, teams
can build a strong foundation for their ML systems. The methodical problem
definition process provides a framework applicable across the ML spectrum.
Whether developing computer vision systems for medical diagnostics, recom-
mendation engines processing millions of user interactions, or natural language
models analyzing diverse text corpora, this structured approach helps teams
anticipate and plan for their data needs.
This brings us to data pipelines, the foundational infrastructure that trans-
forms raw data into ML-ready formats, while maintaining quality and reliabil-
ity throughout the process. These pipelines implement our carefully defined
requirements in production systems, handling everything from initial data
ingestion to final feature generation.
Data Governance
Data Ingestion
Storage Layer
ML Training
Feature Creation /
Engineering
Data Labeling
Processing Layer
Figure 6.6: Training different mod- Training Data Same Data Same Data Same Data Same Data
els on the same dataset.
Propagated Biases Shared Limitations Dataset Blind Spots Common Weaknesses Systemic Issues
product images for recognition systems or social media platforms for computer
vision applications. Stanford’s LabelMe project demonstrated this approach’s
potential early on, scraping Flickr to create a diverse dataset of over 63,000
annotated images.
The impact of web scraping extends well beyond computer vision systems. In
natural language processing, web-scraped data has enabled the development of
increasingly sophisticated ML systems. Large language models, such as Chat-
GPT and Claude, rely on vast amounts of text scraped from the public internet
and media to learn language patterns and generate responses (Groeneveld et
al. 2024). Similarly, specialized ML systems like GitHub’s Copilot demonstrate
how targeted web scraping, in this case of code repositories, can create powerful
domain-specific assistants (M. Chen et al. 2021).
Production ML systems often require continuous data collection to main-
tain relevance and performance. Web scraping facilitates this by gathering
structured data like stock prices, weather patterns, or product information for
analytical applications. However, this continuous collection introduces unique
challenges for ML systems. Data consistency becomes crucial, as variations
in website structure or content formatting can disrupt the data pipeline and
affect model performance. Proper data management through databases or
warehouses becomes essential not just for storage, but for maintaining data
quality and enabling model updates.
Despite its utility, web scraping presents several challenges that ML system
developers must carefully consider. Legal and ethical constraints can limit data
collection, as not all websites permit scraping, and violating these restrictions
can have serious consequences. When building ML systems with scraped
data, teams must carefully document data sources and ensure compliance
with terms of service and copyright laws. Privacy considerations become
particularly critical when dealing with user-generated content, often requiring
robust anonymization procedures.
Technical limitations also affect the reliability of web-scraped training data.
Rate limiting by websites can slow data collection, while the dynamic nature
of web content can introduce inconsistencies that impact model training. As
shown in Figure 6.7, web scraping can yield unexpected or irrelevant data, for
example, historical images appearing in contemporary image searches, that
can pollute training datasets and degrade model performance. These issues 3
highlight the importance of thorough data validation and cleaning processes Crowdsourcing became pop-
address, the data is aggregated into broader categories (e.g., age ranges, zip
code prefixes). For example, a user’s exact age of 37 might be generalized
to an age range of 30-39, while their exact address might be bucketed into a
city level granularity. This technique clearly reduces the risk of identifying an
individual by sharing data in aggregated form; however, we might consequently
lose analytical prediction. Furthermore, if granularity is not chosen correctly,
individuals may still be able to be identified under certain conditions.
Pseudonymization is the process of replacing direct identifiers (like names,
Social Security numbers, or email addresses) with artificial identifiers, or
“pseudonyms.” These pseudonyms must not reveal, or be easily traceable
to, the original data subject. This is commonly used in health records or in any
situation where datasets need personal identities removed, but maintain unique
entries. This approach allow maintaining individual-level data for analysis
(since records can be traced through pseudonyms), while reducing the risk of
direct identification. However, if the “key” linking the pseudonym to the real
identifier is compromised, re-identification becomes possible.
𝑘-anonymity ensures that each record in a dataset is indistinguishable from
at least 𝑘 − 1 other records. This is achieved by suppressing or generalizing
quasi-identifiers, or attributes that, in combination, could be used to re-identify
an individual (e.g., zip code, age, gender). For example, if 𝑘 = 5, every record
in the dataset must share the same combination of quasi-identifiers with at
least four other records. Thus, an attacker cannot pinpoint a single individual
simply by looking at these attributes. This approach provides a formal privacy
guarantee that helps reduce chances of individual re-identification. However,
it is extremely high touch and may require a significant level of data distortion
and does not protect against things like homogeneity or background knowledge
attacks.
Differential privacy (DP) adds carefully calibrated “noise” or randomized
data perturbations to query results or datasets. The goal is to ensure that the
inclusion or exclusion of any single individual’s data does not significantly
affect the output, thereby concealing their presence. Introduced noise is con-
trolled by the 𝜖 parameter in 𝜖-Differential Privacy, balancing data utility and
privacy guarantees. The clear advantages this approach provides are strong
mathematical guarantees of privacy, and DP is widely used in academic and
industrial settings (e.g., large-scale data analysis). However, the added noise
can affect data accuracy and subsequent model performance; proper parameter
tuning is crucial to ensure both privacy and usefulness.
In summary, effective data anonymization is a balancing act between pri-
vacy and utility. Techniques such as masking, generalization, pseudonymiza-
tion, k-anonymity, and differential privacy each target different aspects of re-
identification risk. By carefully selecting and combining these methods, organi-
zations can responsibly derive value from sensitive datasets while respecting
the privacy rights and expectations of the individuals represented within them.
mains as well, where synthetic data can fill gaps in underrepresented scenarios
or edge cases.
In addition to expanding datasets, synthetic data addresses critical ethical
and privacy concerns. Unlike real-world data, synthetic data attempts to not
tie back to specific individuals or entities. This makes it especially useful in
sensitive domains such as finance, healthcare, or human resources, where data
confidentiality is paramount. The ability to preserve statistical properties while
removing identifying information allows researchers to maintain high ethical
standards without compromising the quality of their models. In healthcare,
privacy regulations such as GDPR7 and HIPAA8 limit the sharing of sensitive 7
GDPR: General Data Protection
patient information. Synthetic data generation enables the creation of realistic
Regulation, a legal framework that
yet anonymized datasets that can be used for training diagnostic models without
sets guidelines for the collection and
compromising patient privacy.
processing of personal information
Poorly generated data can misrepresent underlying real-world distributions,
in the EU.
introducing biases or inaccuracies that degrade model performance. Validating
synthetic data against real-world benchmarks is essential to ensure its reliability. 8
HIPAA: Health Insurance Porta-
Additionally, models trained primarily on synthetic data must be rigorously bility and Accountability Act, U.S.
tested in real-world scenarios to confirm their ability to generalize effectively. legislation that provides data pri-
Another challenge is the potential amplification of biases present in the original vacy and security provisions for
datasets used to inform synthetic data generation. If these biases are not care- safeguarding medical information.
fully addressed, they may be inadvertently reinforced in the resulting models.
A critical consideration is maintaining proper balance between synthetic and
real-world data during training - if models are overly trained on synthetic data,
their outputs may become nonsensical and model performance may collapse.
Synthetic data has revolutionized the way machine learning systems are
trained, providing flexibility, diversity, and scalability in data preparation.
However, as its adoption grows, practitioners must remain vigilant about its
limitations and ethical implications. By combining synthetic data with rigor-
ous validation and thoughtful application, machine learning researchers and
engineers can unlock its full potential while ensuring reliability and fairness in
their systems.
Target
Source 1 Source 1 (MPP database)
Figure 6.9: Key differences between
Extract, Transform, Load (ETL) ver-
Target Source 2 sus Extract, Load, Transform (ELT).
Source 2
Staging
Final
tables
tables
Source 3 Source 3
E→T→L E→L→T
approach when addressing evolving analytical needs in ML systems. approach where data structure is
defined at access time, not during
By deferring transformations, ELT can accommodate varying uses of the
ingestion, enabling versatile use of
same dataset, which is particularly useful in exploratory data analysis phases
raw data.
of ML projects or when multiple models with different data requirements are
being developed simultaneously. However, it’s important to note that ELT
places greater demands on storage systems and query engines, which must
handle large amounts of unprocessed information.
In practice, many ML systems employ a hybrid approach, selecting ETL or
ELT on a case-by-case basis depending on the specific requirements of each data
source or ML model. For example, a system might use ETL for structured data
from relational databases where schemas are well-defined and stable, while
employing ELT for unstructured data like text or images where transformation
requirements may evolve as the ML models are refined.
6.5. Data Ingestion 190
store failed recognition attempts for analysis, helping identify patterns in false
negatives or system failures. Data validation becomes particularly important for
maintaining wake word detection accuracy—incoming audio must be checked
for quality issues like clipping, noise levels, and appropriate sampling rates.
For example, consider a smart home device processing the wake word “Alexa.”
The ingestion pipeline must validate:
• Audio quality metrics (signal-to-noise ratio, sample rate, bit depth)
• Recording duration (typically 1-2 seconds for wake words)
• Background noise levels
• Speaker proximity indicators
Invalid samples are routed to dead letter queues for analysis, while valid
samples are processed in real-time for wake word detection.
This case study illustrates how real-world ML systems must carefully balance
different ingestion patterns, handle multiple data sources, and maintain robust
error handling—all while meeting strict latency and reliability requirements.
The lessons from KWS systems apply broadly to other ML applications requiring
real-time processing capabilities alongside continuous model improvement.
These processes ensure that data is clean, relevant, and optimally formatted for
machine learning algorithms.
ExampleGen
Tuner Trainer
Pusher
istics, including signal-to-noise ratio (SNR), audio clarity scores, and speaking
rate consistency. For instance, a KWS quality assessment pipeline might auto-
matically flag recordings where background noise exceeds acceptable thresh-
olds or where the wake word is spoken too quickly or unclearly, ensuring only
high-quality samples are used for model training.
These quality metrics must be carefully calibrated to reflect real-world operat-
ing conditions. A robust training dataset incorporates both pristine recordings
and samples containing controlled levels of environmental variations. For in-
stance, while recordings with signal-masking interference are excluded, the
dataset should include samples with measured background acoustics, variable
speaker distances, and concurrent speech or other forms of audio signals. This
approach to data diversity ensures the model maintains wake word detection
reliability across the full spectrum of deployment environments and acoustic
conditions.
Once quality is assured, transforming audio data for KWS involves converting
raw waveforms into formats suitable for ML models. The typical transformation
12
Spectrogram: A visual rep-
pipeline converts audio signals into spectrograms12 or mel-frequency cepstral
resentation of the spectrum of fre-
coefÏcients (MFCCs)13 , standardizing the representation across different record-
quencies in a signal as it varies over
ing conditions. This transformation must be consistently applied across both
time, commonly used in audio pro-
training and inference, often with additional considerations for real-time pro-
cessing.
cessing on edge devices.
Figure 6.11 illustrates this transformation process. The top panel is a raw
13
Mel-Frequency Cepstral Co- waveform of a simulated audio signal, which consists of a sine wave mixed
efÏcients (MFCCs): Features ex- with noise. This time-domain representation highlights the challenges posed
tracted from audio signals that rep- by real-world recordings, where noise and variability must be addressed. The
resent the short-term power spec- middle panel shows the spectrogram of the signal, which maps its frequency
trum, widely used in speech and au- content over time. The spectrogram provides a detailed view of how energy is
dio analysis. distributed across frequencies, making it easier to analyze patterns that could
influence wake word recognition, such as the presence of background noise
or signal distortions The bottom panel shows the MFCCs, derived from the
spectrogram. These coefÏcients compress the audio information into a format
that emphasizes speech-related characteristics, making them well-suited for
KWS tasks.
With transformed data in hand, feature engineering for KWS focuses on
extracting characteristics that help distinguish wake words from background
speech. Engineers might create features capturing tonal variations, speech
energy patterns, or temporal characteristics. For the wake word “Alexa,” fea-
tures might include energy distribution across frequency bands, pitch con-
tours, and duration patterns that characterize typical pronunciations. While
hand-engineered speech features have seen much success, learned features
(Zeghidour et al. 2021) are increasingly common.
In practice, bringing all these elements together, KWS processing pipelines
must handle both batch processing for training and real-time processing for in-
ference. The pipeline typically includes stages for audio preprocessing, feature
extraction, and quality filtering. Importantly, these pipelines must be designed
to operate efÏciently on edge devices while maintaining consistent processing
steps between training and deployment.
Chapter 6. Data Engineering 197
Modern machine learning systems must efÏciently handle the creation, stor- eters to capture complex data pat-
age, and management of labels across their data pipeline. The systems ar- terns. A general rule of thumb in
chitecture must support various labeling workflows while maintaining data ML is to have a training dataset that
consistency, ensuring quality, and managing computational resources effec- is at least 10 times larger than the
tively. These requirements compound when dealing with large-scale datasets model’s parameter count to ensure
The systematic challenges extend beyond just storing and managing labels. ting.
6.7.4 AI in Annotation
As machine learning systems grow in scale and complexity, organizations in-
creasingly leverage AI to accelerate and enhance their labeling pipelines. This
approach introduces new system design considerations around model deploy-
ment, resource management, and human-AI collaboration. The fundamental
challenge stems from data volume. Manual annotation alone cannot keep pace
with modern ML systems’ data needs. As illustrated in Figure 6.14, AI assis-
tance offers several paths to scale labeling operations, each requiring careful
system design to balance speed, quality, and resource usage.
costs, rate limiting, and output validation. Many organizations adopt a tiered
approach, using smaller specialized models for routine cases while reserving
larger LLMs for complex scenarios.
Methods such as active learning15 complement these approaches by intelli-
15
gently prioritizing which examples need human attention (Coleman et al. 2022). A machine learning approach
These systems continuously analyze model uncertainty to identify valuable where the model selects the most
labeling candidates for humans to label. The infrastructure must efÏciently informative data points for labeling
compute uncertainty metrics, maintain task queues, and adapt prioritization to improve learning efÏciency.
strategies based on incoming labels. Consider a medical imaging system: active
learning might identify unusual pathologies for expert review while handling
routine cases automatically.
Quality control becomes increasingly crucial as these AI components interact.
The system must monitor both AI and human performance, detect potential
errors, and maintain clear label provenance. This requires dedicated infrastruc-
ture tracking metrics like model confidence and human-AI agreement rates.
In safety-critical domains like self-driving cars, these systems must maintain
particularly rigorous standards while processing massive streams of sensor
data.
Real-world deployments demonstrate these principles at scale. Medical imag-
ing systems (Krishnan, Rajpurkar, and Topol 2022) combine pre-annotation for
common conditions with active learning for unusual cases, all while maintain-
ing strict patient privacy.
Self-driving vehicle systems coordinate multiple AI models to label diverse
sensor data in real-time. Social media platforms process millions of items
hourly, using tiered approaches where simpler models handle clear cases while
complex content routes to more sophisticated models or human reviewers.
While AI assistance offers clear benefits, it also introduces new failure modes.
Systems must guard against bias amplification, where AI models trained on
biased data perpetuate those biases in new labels. The infrastructure needs
robust monitoring to detect such issues and mechanisms to break problematic
feedback loops. Human oversight remains essential, requiring careful interface
design to help annotators effectively supervise and correct AI output.
acoustic signal that conveys more than just the words themselves. This signal
encapsulates subtle transitions between words, variations in pronunciation,
and the natural rhythm of speech. The primary challenge lies in accurately
pinpointing the exact location of each word within this continuous audio stream.
This is where automated forced alignment proves useful. Tools such as the
Montreal Forced Aligner (McAuliffe et al. 2017) analyze both the audio and
its transcription, mapping the timing relationship between written words and
spoken sounds, and attempts to mark the boundaries of when each word begins
and ends in a speech recording at millisecond-level precision. For high-resource
languages such as English, high-quality automated alignments are available
“out-of-box” while alignments for low-resource languages must be bootstrapped
on the speech data and transcriptions themselves, which can negatively impact
timing quality.
With these precise timestamps, the extraction system can generate clean,
one-second samples of individual keywords. However, this process requires
careful engineering decisions. Background noise might interfere with detecting
word boundaries. Speakers may stretch, compress, or mispronounce words
in unexpected ways. Longer words may not fit within the default 1-second
boundary. In order to aid ML practitioners in filtering out lower-quality samples
in an automated fashion, MSWC provides a self-supervised anomaly detection
algorithm, using acoustic embeddings to identify potential issues based on
embedding distances to k-means clusters. This automated validation becomes
particularly crucial given the scale of the dataset, which includes over 23 million
samples across more than 340,000 words in 50+ languages. Traditional manual
review could not maintain consistent standards across such volume without
significant expense.
Modern voice assistant developers often build upon this type of labeling
foundation. An automated corpus like MSWC may not contain the specific key-
words an application developer wishes to use for their envisioned KWS system,
but the corpus can provide a starting point for KWS prototyping in many under-
served languages spoken around the world. While MSWC provides automated
labeling at scale, production systems may add targeted human recording and
verification for challenging cases, rare words, or difÏcult acoustic environments.
6.8. Data Storage 206
Table 6.1: Comparative overview of the database, data warehouse, and data
lake.
Attribute Conventional Database Data Warehouse Data Lake
Purpose Operational and Analytical and reporting Storage for raw and diverse
transactional data for future processing
Data type Structured Structured Structured, semi-structured,
and unstructured
Scale Small to medium Medium to large volumes Large volumes of diverse
volumes data
Performance Optimized for Optimized for analytical Optimized for scalable
Optimization transactional queries queries (OLAP) storage and retrieval
(OLTP)
Examples MySQL, PostgreSQL, Google BigQuery, Google Cloud Storage, AWS
Oracle DB Amazon Redshift, S3, Azure Data Lake Storage
Microsoft Azure Synapse
A critical aspect of this phase is managing data drift, where the characteristics
of incoming data change over time. Storage systems must efÏciently capture and
store incoming data along with prediction results, enabling ongoing analysis
to detect and address shifts in data distributions. This ensures that models
remain accurate and aligned with their intended use cases.
The sheer volume of logging and monitoring data generated by high-trafÏc
ML services introduces questions of data retention and accessibility. Organiza-
tions must balance the need to retain historical data for analysis against the cost
and complexity of storing it. Strategies such as tiered storage and compression
can help manage costs while ensuring that critical data remains accessible when
needed.
Regulated industries often require immutable storage to support auditing
and compliance efforts. Storage systems designed for this purpose guarantee
data integrity and non-repudiability, ensuring that stored data cannot be altered
or deleted. Blockchain-inspired solutions and write-once-read-many (WORM)
technologies are commonly employed to meet these stringent requirements.
Feature Store
Fetch feature
Ingest data vectors
Model serving
Monitoring
Figure 6.16: High-level overview
of feature stores interact with data,
Storage Serving
users and model training and de-
ployment.
Streaming data Transformations
Registry
Fetch training
dataset
Define feature
Search and
discover features
Model training
Batch data
Data Scientist
Organization
X
Data Security Policies
Data Operations
Data Data catalogs
Governance
Data Sourcing
by adding carefully calibrated noise to the data. This ensures that individual
identities are protected while preserving the statistical patterns necessary for
model training. These techniques allow ML systems to benefit from data-driven
insights without compromising ethical considerations (Dwork, n.d.), which we
will learn more about in the Responsible AI chapter.
Regulatory compliance is a critical area where data governance plays a cen-
tral role. Laws such as the GDPR in Europe and the HIPAA in the United
States impose strict requirements on data handling. Compliance with these
regulations often involves implementing features like the ability to delete data
upon request or providing individuals with copies of their data, and a “right
to explanation” on decisions made by algorithms (Wachter, Mittelstadt, and
Russell 2017). These measures not only protect individuals but also ensure
organizations avoid legal and reputational risks.
Documentation and metadata management, which are often less discussed,
are just as important for transparency and reproducibility in ML systems. Clear
records of data lineage, including how data flows and transforms throughout
the ML pipeline, are essential for accountability. Standardized documenta-
tion frameworks, such as Data Cards proposed by Pushkarna, Zaldivar, and
Kjartansson (2022), offer a structured way to document the characteristics, limi-
tations, and potential biases of datasets. For example, as shown in Figure 6.18,
the Open Images Extended, More Inclusively Annotated People (MIAP) dataset,
uses a data card to provide detailed information about its motivations, intended
use cases, and known risks. This type of documentation enables developers to
evaluate datasets effectively and promotes responsible use.
6.9. Data Governance 218
6.10 Conclusion
Data engineering is the backbone of any successful ML system. By thoughtfully
defining problems, designing robust pipelines, and practicing rigorous data
governance, teams establish a foundation that directly influences model per-
formance, reliability, and ethical standing. Effective data acquisition strategies,
whether by utilizing existing datasets, employing web scraping techniques, or
engaging in crowdsourcing, must balance the realities of domain constraints,
privacy obligations, and labeling complexities. Likewise, decisions around data
ingestion (batch or streaming) and transformation (ETL or ELT) affect both cost
and throughput, with monitoring and observability essential to detect shifting
data quality.
Throughout this chapter, we saw how critical it is to prepare data well in
advance of modeling. Data labeling emerges as a particularly delicate phase:
it involves human effort, requires strong quality control practices, and has
ethical ramifications. Storage choices, such as relational databases, data ware-
houses, data lakes, or specialized systems, must align with both the volume
and velocity of ML workloads. Feature stores and caching strategies support
efÏcient retrieval across training and serving pipelines, while good data gover-
nance ensures adherence to legal regulations, protects privacy, and maintains
stakeholder trust.
All these elements interlock to create an ecosystem that reliably supplies ML
models with the high-quality data they need. When done well, data engineering
empowers teams to iterate faster, confidently deploy new features, and build
systems capable of adapting to real-world complexity. The next chapters will
build on these foundations, exploring how optimized training, robust model
operations, and security considerations together form a holistic approach to
delivering AI solutions that perform reliably and responsibly at scale.
6.11 Resources
Slides
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 7
AI Frameworks
Purpose
How do AI frameworks bridge the gap between theoretical design and practical im-
plementation, and what role do they play in enabling scalable and efÏicent machine
learning systems?
AI frameworks are the middleware software layer that transforms abstract
model specifications into executable implementations. The evolution of these
frameworks reveals fundamental patterns for translating high-level designs into
efÏcient computational workflows and system execution. Their architecture
shines light on the essential trade-offs between abstraction, performance, and
portability, providing systematic approaches to managing complexity in ma-
chine learning systems. Understanding framework capabilities and constraints
offers insights into the engineering decisions that shape system scalability,
enabling the development of robust, deployable solutions across diverse com-
puting environments.
221
7.1. Overview 222
L Learning Objectives
7.1 Overview
Modern machine learning development relies fundamentally on machine learn-
ing frameworks, which are comprehensive software libraries or platforms de-
signed to simplify the development, training, and deployment of machine
learning models. These frameworks play multiple roles in ML systems, much
like operating systems are the foundation of computing systems. Just as operat-
ing systems abstract away the complexity of hardware resources and provide
standardized interfaces for applications, ML frameworks abstract the intricacies
of mathematical operations and hardware acceleration, providing standardized
APIs for ML development.
The capabilities of ML frameworks are diverse and continuously evolving.
They provide efÏcient implementations of mathematical operations, automatic
differentiation capabilities, and tools for managing model development, hard-
ware acceleration, and memory utilization. For production systems, they offer
standardized approaches to model deployment, versioning, and optimization.
However, due to their diversity, there is no universally agreed-upon definition
of an ML framework. To establish clarity for this chapter, we adopt the following
definition:
Framework Definition
Theano introduces
computational
graphs
Developer Interface
Generates
Shapes Execution
Behavior
Fundamentals
Provides
Computational Graphs Influences
Structure For
Data Flow
Optimizes Execution
Data Handling
Coordinates Feeds
with Execution and Abstraction Data Into
Core Operations
x
Figure 7.4: Basic example of a com-
f (x, y) z putational graph.
Computational Graph
Memory
Interacts with
Data Flow Operation Node 2 Data Flow Management
transform its inputs into outputs. However, layers add an extra dimension:
they maintain and update internal parameters during training. For example, a
convolutional layer not only specifies how to perform convolution operations
but also learns and stores the optimal convolution filters for a given task.
Frameworks like TensorFlow and PyTorch leverage this abstraction to simplify
model implementation. When a developer writes tf.keras.layers. Conv2D,
the framework constructs the necessary graph nodes for convolution opera-
tions, parameter management, and data flow. This high-level interface shields
developers from the complexities of implementing convolution operations,
managing memory, or handling parameter updates during training.
Define Declare
Figure 7.6: The two-phase execution Operations Variables
Build Graph Load Data Run Graph Get Results
model of static computation graphs.
Runtime Execution
def f(x):
a = x * x # Square
b = sin(x) # Sine
return a * b # Product
# Step 1: x²
a = 4.0 # (2.0)²
da = 4.0 # 2 * 2.0
# Step 2: sin(x)
b = 0.909 # sin(2.0)
db = -0.416 # cos(2.0)
# Final result
result = 3.637 # 4.0 * 0.909
dresult = 2.805 # 4.0 * (-0.416) + 0.909 * 4.0
of such computations can be seen again in Listing 7.4, where each intermediate
operation is made explicit.
def f(x):
a = x * x
b = sin(x)
return a * b
Use Cases. While forward mode automatic differentiation isn’t the primary
choice for training full neural networks, it plays several important roles in
modern machine learning frameworks. Its strength lies in scenarios where we
need to understand how small changes in inputs affect a network’s behavior.
Consider a data scientist trying to understand why their model makes certain
predictions. They might want to analyze how changing a single pixel in an
image or a specific feature in their data affects the model’s output, as illustrated
in Listing 7.6.
As the computation moves through each layer, forward mode carries both
values and derivatives, making it straightforward to see how input perturba-
tions ripple through to the final prediction. For each operation, we can track
exactly how small changes propagate forward.
7.3. Fundamental Concepts 238
def f(x):
a = x * x # First operation: square x
b = sin(x) # Second operation: sine of x
c = a * b # Third operation: multiply results
return c
In this function shown in Listing 7.8, we have three operations that create
a computational chain. Notice how ‘x’ influences the final result ‘c’ through
two different paths: once through squaring (a = x²) and once through sine (b =
sin(x)). We’ll need to account for both paths when computing derivatives.
First, the forward pass computes and stores values, as illustrated in Listing 7.9.
Listing 7.9: Forward pass: computing and storing each intermediate value
Then comes the backward pass. This is where reverse mode shows its ele-
gance. This process is demonstrated in Listing 7.10, where we compute the
gradient starting from the output.
The power of reverse mode becomes clear when we consider what would
happen if we added more operations that depend on x. Forward mode would
7.3. Fundamental Concepts 240
need to track derivatives through each new path, but reverse mode efÏciently
handles all paths in a single backward pass. This is exactly the scenario in
neural networks, where each weight can affect the final loss through multiple
paths in the network.
Implementation Structure. The implementation of reverse mode in machine learn-
ing frameworks requires careful orchestration of computation and memory.
While forward mode simply augments each computation, reverse mode needs
to maintain a record of the forward computation to enable the backward pass.
Modern frameworks accomplish this through computational graphs and auto-
matic gradient accumulation.
Let’s extend our previous example to a small neural network computation —
see Listing 7.11 for the code structure.
During the forward pass, the framework doesn’t just compute values — it
builds a graph of operations while tracking intermediate results, as illustrated
in Listing 7.12.
x = 1.0
w1 = 2.0
w2 = 3.0
def deep_network(input_tensor):
# A typical deep network computation
layer1 = large_dense_layer(input_tensor)
activation1 = relu(layer1)
layer2 = large_dense_layer(activation1)
activation2 = relu(layer2)
# ... many more layers
output = final_layer(activation_n)
return output
remember what happened during the forward pass. This seemingly simple
requirement creates interesting challenges for machine learning frameworks.
Unlike traditional programs that can discard intermediate results as soon as
they’re used, AD systems must carefully preserve computational history.
This necessity is illustrated in Listing 7.20, which shows what happens during
a neural network’s forward pass.
Chapter 7. AI Frameworks 245
def neural_network(x):
# Each operation creates values we need to remember
a = layer1(x) # Must store for backward pass
b = relu(a) # Must store input to relu
c = layer2(b) # Must store for backward pass
return c
When this network processes data, each operation creates not just its output,
but also a memory obligation. The multiplication in layer1 needs to remem-
ber its inputs because computing its gradient later will require them. Even
the seemingly simple relu function must track which inputs were negative
to correctly propagate gradients. As networks grow deeper, these memory
requirements accumulate — as seen in Listing 7.21.
This memory challenge becomes particularly interesting with deep neural
networks.
Each layer’s computation adds to our memory burden. The framework must
keep hidden1 in memory until we’ve computed gradients through hidden2, but
after that, we can safely discard it. This creates a wave of memory usage that
peaks when we start the backward pass and gradually recedes as we compute
gradients.
Modern frameworks handle this memory choreography automatically. They
track the lifetime of each intermediate value - how long it must remain in mem-
ory for gradient computation. When training large models, this careful memory
management becomes as crucial as the numerical computations themselves.
The framework frees memory as soon as it’s no longer needed for gradient com-
putation, ensuring that our memory usage, while necessarily large, remains as
efÏcient as possible.
This simple loop masks complex system interactions. The AD system must
coordinate with multiple framework components: the memory allocator, the
device manager, the operation scheduler, and the optimizer. Each gradient
computation potentially triggers data movement between devices, memory
allocation, and kernel launches on accelerators.
The scheduling of AD operations on modern hardware accelerators is illus-
trated in Listing 7.23.
def parallel_network(x):
# These operations could run concurrently
branch1 = conv_layer1(x)
branch2 = conv_layer2(x)
The AD system must track dependencies not just for correct gradient compu-
tation, but also for efÏcient hardware utilization. It needs to determine which
gradient computations can run in parallel and which must wait for others to
complete. This dependency tracking extends across both forward and backward
passes, creating a complex scheduling problem.
Modern frameworks handle these system-level concerns while maintaining a
simple interface for users. Behind the scenes, they make sophisticated decisions
Chapter 7. AI Frameworks 247
about operation scheduling, memory allocation, and data movement, all while
ensuring correct gradient computation through the computational graph.
7.3.2.5 Summary
Automatic differentiation systems represent an important computational ab-
straction in machine learning frameworks, transforming the mathematical
concept of derivatives into efÏcient implementations. Through our examination
of both forward and reverse modes, we’ve seen how frameworks balance math-
ematical precision with computational efÏciency to enable training of modern
neural networks.
The implementation of AD systems reveals key design patterns in machine
learning frameworks. One such pattern is shown in Listing 7.24.
def neural_network(x):
hidden = w1 * x # What exactly is x?
activated = relu(hidden) # How is hidden stored?
output = w2 * activated # What type of multiplication?
return output
7.3.3.1 Tensors
Machine learning frameworks process and store numerical data as tensors.
Every computation in a neural network, from processing input data to updating
model weights, operates on tensors. Training batches of images, activation maps
in convolutional networks, and parameter gradients during backpropagation
all take the form of tensors. This unified representation allows frameworks to
implement consistent interfaces for data manipulation and optimize operations
across different hardware architectures.
7.3. Fundamental Concepts 250
1 1 ... 2
3 3 ... 5
Figure 7.8: Visualization of a tensor
data structure.
5 5 3
0 .. .. ..
. . .
3 3 3
3 Color Channels
2 1 9
Figure 7.9: Visualization of colored
image structure that can be eas- 8 7 5 4
ily stored as a 3D Tensor. Credit:
Niklas Lang
6 2 5 1 3
Height: 3 Pixel 32 15 4 2
1 8 3
Width: 3 Pixel
leverage parallelism across CPU cores, disk I/O, and memory transfers to feed
accelerators at full capacity.
In large, multi-system distributed training scenarios, dataset structures also
handle coordination between nodes, ensuring that each worker processes a
distinct subset of data while maintaining consistency in operations like shufÒing.
This coordination prevents redundant computation and supports scalability
across multiple devices and machines.
Parameter Structures. Parameter structures store the numerical values that
define a machine learning model. These include the weights and biases of
neural network layers, along with auxiliary data such as batch normalization
statistics and optimizer state. Unlike datasets, which are transient, parameters
persist throughout the lifecycle of model training and inference.
The design of parameter structures must balance efÏcient storage with rapid
access during computation. For example, convolutional neural networks re-
quire parameters for filters, fully connected layers, and normalization layers,
each with unique shapes and memory alignment requirements. Frameworks
organize these parameters into compact representations that minimize memory
consumption while enabling fast read and write operations.
A key challenge for parameter structures is managing memory efÏciently
across multiple devices (0003 et al. 2014). During distributed training, frame-
works may replicate parameters across GPUs for parallel computation while
keeping a synchronized master copy on the CPU. This strategy ensures consis-
tency while reducing the latency of gradient updates. Additionally, parameter
structures often leverage memory sharing techniques to minimize duplication,
such as storing gradients and optimizer states in place to conserve memory.
Parameter structures must also adapt to various precision requirements.
While training typically uses 32-bit floating-point precision for stability, reduced
precision such as 16-bit floating-point or even 8-bit integers is increasingly used
for inference and large-scale training. Frameworks implement type casting and
mixed-precision management to enable these optimizations without compro-
mising numerical accuracy.
Execution Structures. Execution structures coordinate how computations are
performed on hardware, ensuring that operations execute efÏciently while
respecting device constraints. These structures work closely with computational
graphs, determining how data flows through the system and how memory is
allocated for intermediate results.
One of the primary roles of execution structures is memory management.
During training or inference, intermediate computations such as activation
maps or gradients can consume significant memory. Execution structures
dynamically allocate and deallocate memory buffers to avoid fragmentation
and maximize hardware utilization. For example, a deep neural network might
reuse memory allocated for activation maps across layers, reducing the overall
memory footprint.
These structures also handle operation scheduling, ensuring that computa-
tions are performed in the correct order and with optimal hardware utilization.
On GPUs, for instance, execution structures can overlap computation and data
transfer operations, hiding latency and improving throughput. When running
7.3. Fundamental Concepts 254
l
lel
ra
Pa
el
od
GPU 4 GPU 12 GPU 20 GPU 28
M
Pipeline Parallel
The immediate execution model is intuitive and aligns with common pro-
gramming practices, making it easier to use. Errors can be detected and resolved
immediately during execution, simplifying debugging. Dynamic graphs allow
for adjustments on-the-fly, making them ideal for tasks requiring variable graph
7.3. Fundamental Concepts 256
These hybrid approaches help bridge the gap between the flexibility of im-
perative programming and the efÏciency of symbolic programming, enabling
developers to navigate the trade-offs effectively.
import tensorflow as tf
In this code snippet, each line is executed sequentially. When we create the
tensors x and y, they are immediately instantiated in memory. The matrix mul-
tiplication tf.matmul(x, y) is computed right away, and the result is stored
in z. When we print z, we see the output of the computation immediately.
Eager execution offers several advantages. It provides immediate feedback, al-
lowing developers to inspect intermediate values easily. This makes debugging
more straightforward and intuitive. It also allows for more dynamic and flexible
code structures, as the computation graph can change with each execution.
However, eager execution has its trade-offs. Since operations are executed
immediately, the framework has less opportunity to optimize the overall compu-
tation graph. This can lead to lower performance compared to more optimized
execution paradigms, especially for complex models or when dealing with large
datasets.
Eager execution is particularly well-suited for research, interactive devel-
opment, and rapid prototyping. It allows data scientists and researchers to
quickly iterate on their ideas and see results immediately. Many modern ML
7.3. Fundamental Concepts 258
import tensorflow.compat.v1 as tf
tf.disable_eager_execution()
In this code snippet, we first define the structure of our computation. The
placeholder operations create nodes in the graph for input data, while tf.matmul
creates a node representing matrix multiplication. Importantly, no actual com-
putation occurs during this definition phase.
The execution of the graph happens when we create a session and call
sess.run(). At this point, we provide the actual input data through the feed_-
dict parameter. The framework then has the complete graph and can perform
optimizations before running the computation.
Graph execution offers several advantages. It allows the framework to see
the entire computation ahead of time, enabling global optimizations that can
improve performance, especially for complex models. Once defined, the graph
can be easily saved and deployed across different environments, enhancing
portability. It’s particularly efÏcient for scenarios where the same computation
is repeated many times with different data inputs.
However, graph execution also has its trade-offs. It requires developers to
think in terms of building a graph rather than writing sequential operations,
which can be less intuitive. Debugging can be more challenging because errors
Chapter 7. AI Frameworks 259
import torch
@torch.jit.script
def compute(x, y):
return torch.matmul(x, y)
x = torch.randn(2, 2)
y = torch.randn(2, 2)
eagerly. At the same time, it can deliver performance improvements for critical
parts of the computation. JIT compilation can also adapt to the specific data
types and shapes being used, potentially resulting in more efÏcient code than
static graph compilation.
However, JIT compilation also has some considerations. The first execution
of a compiled function may be slower due to the overhead of the compilation
process. Additionally, some complex Python constructs may not be easily JIT-
compiled, requiring developers to be aware of what can be optimized effectively.
JIT compilation is particularly useful in scenarios where you need both the
flexibility of eager execution for development and prototyping, and the per-
formance benefits of compilation for production or large-scale training. It’s
commonly used in research settings where rapid iteration is necessary but
performance is still a concern.
Many modern ML frameworks incorporate JIT compilation to provide devel-
opers with a balance of ease-of-use and performance optimization, as shown in
Table 7.2. This balance manifests across multiple dimensions, from the learning
curve that gradually introduces optimization concepts to the runtime behavior
that combines immediate feedback with performance enhancements. The table
highlights how JIT compilation bridges the gap between eager execution’s pro-
gramming simplicity and graph execution’s performance benefits, particularly
in areas like memory usage and optimization scope.
Output layer Output Output Output Output layer Output Output Output
Figure 7.11: Data parallelism.
• • • • • •
• • • • • •
GPU 1
Neural • • • Neural • • •
Network A Network A
• • • • • •
• • • • • •
Input layer Input Input Input Input layer Input Input Input
ML System
Model Parallelism. While data parallelism is effective for many machine learn-
ing workloads, some models are too large to fit within the memory of a single
device. Model parallelism addresses this limitation by partitioning the model
itself across multiple devices, allowing each to process a different portion of
the computation. Unlike data parallelism, where the entire model is replicated
on each device, model parallelism divides layers, tensors, or specific operations
among available hardware resources, as shown in Figure 8.16. This approach
enables training of large-scale models that would otherwise be constrained by
single-device memory limits.
Model Parallelism
• • •
Hidden layer
GPU 1
• • •
Hidden layer
Neural • • •
Network A
Hidden layer
GPU 0
• • •
Hidden layer
• • •
Full Dataset
ML System
pipeline parallelism, and expert parallelism, among others. The specific trade-
offs and applications of these techniques will be explored in later chapters, and
Figure 7.13 shows some initial intuition in comparing parallelism strategies.
Regardless of the exact approach, AI frameworks play an important role in
managing workload partitioning, scheduling computations efÏciently, and
minimizing communication overhead—ensuring that even the largest models
can be trained at scale.
Tensor Parallelism (2 GPUs)
Input Output
Linear 4 × 4 Linear 4 × 2
1×4 1×2 Figure 7.13: An example depiction
GPU 0 GPU 1 GPU 0 GPU 1 of tensor parallelism versus pipeline
parallelism. Note how the first
case shards each Linear layer across
Pipeline Parallelism (2 GPUs) GPUs, while the second assigns the
first Linear to the first GPU and the
GPU 0 GPU 1
second to the second GPU.
Input Output
Linear 4 × 4 Linear 4 × 2
1×4 1×2
Compute Kernel
Scheduling GEMM Operations
Management Figure 7.14: Hierarchical structure
of operations in machine learning
frameworks.
Memory Management BLAS Operations Memory Abstraction
Element-wise
Resource Optimization Execution Control
Operations
import torch
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(3, 64, kernel_size=3)
self.fc = nn.Linear(64, 10)
model = keras.Sequential([
keras.layers.Conv2D(
64,
3,
activation='relu',
input_shape=(32, 32, 3)),
keras.layers.Flatten(),
keras.layers.Dense(10)
])
At the heart of every machine learning framework lies a set of core libraries,
forming the foundation upon which all other components are built. These
libraries provide the essential building blocks for machine learning operations,
implementing fundamental tensor operations that serve as the backbone of
numerical computations. Heavily optimized for performance, these operations
often leverage low-level programming languages and hardware-specific op-
timizations to ensure efÏcient execution of tasks like matrix multiplication, a
cornerstone of neural network computations.
Alongside these basic operations, core libraries implement automatic dif-
ferentiation capabilities, enabling the efÏcient computation of gradients for
complex functions. This feature is crucial for the backpropagation algorithm
that powers most neural network training. The implementation often involves
intricate graph manipulation and symbolic computation techniques, abstracting
away the complexities of gradient calculation from the end-user.
Building upon these fundamental operations, core libraries typically provide
pre-implemented neural network layers such as convolutional, recurrent, and
attention mechanisms. These ready-to-use components save developers from
reinventing the wheel for common model architectures, allowing them to focus
on higher-level model design rather than low-level implementation details.
Similarly, optimization algorithms like various flavors of gradient descent are
provided out-of-the-box, further streamlining the model development process.
A simplified example of how these components might be used in practice is
shown in Listing 7.35.
Chapter 7. AI Frameworks 269
import torch
import torch.nn as nn
and logging prediction inputs and outputs for auditing purposes. Tools like
Prometheus and Grafana are often integrated with ML serving systems to
provide comprehensive monitoring solutions.
Figure 7.15: Diagram showing how Gets runs and results Stores results
a data engineer might interact with Webserver Worker
AirFlow, an example orchestration
service, in scheduling tasks, execut-
ing them across distributed workers, Visualizes runs and results
Assigns tasks
and visualizing the results. Tracks and syncs tasks
Schedules tasks
Airflow UI Scheduler Execulor
Writes DAG
unique strengths and ecosystem, but few have remained as industry standards.
Here we examine the mature and major players in the field, starting with a
comprehensive look at TensorFlow, followed by PyTorch, JAX, and other notable
frameworks.
TRAINING DEPLOYMENT
SavedMode
TensorFlow.js
Distribution Strategy
Browser and Node Server
7.6.2 PyTorch
PyTorch, developed by Facebook’s AI Research lab, has gained significant
traction in the machine learning community, particularly among researchers
and academics. Its design philosophy emphasizes ease of use, flexibility, and
dynamic computation, which aligns well with the iterative nature of research
and experimentation.
PyTorch’s architecture lies its dynamic computational graph system. Unlike
the static graphs used in earlier versions of TensorFlow, PyTorch builds the
computational graph on-the-fly during execution. This approach, often re-
ferred to as “define-by-run,” allows for more intuitive model design and easier
debugging that we discussed earlier. Moreover, developers can use standard
Python control flow statements within their models, and the graph structure
can change from iteration to iteration. This flexibility is particularly advanta-
geous when working with variable-length inputs or complex, dynamic neural
network architectures.
PyTorch’s eager execution mode is tightly coupled with its dynamic graph
approach. Operations are executed immediately as they are called, rather than
being deferred for later execution in a static graph. This immediate execu-
tion facilitates easier debugging and allows for more natural integration with
Python’s native debugging tools. The eager execution model aligns closely with
PyTorch’s imperative programming style, which many developers find more
intuitive and Pythonic.
PyTorch’s fundamental data structure is the tensor, similar to TensorFlow and
other frameworks discussed in earlier sections. PyTorch tensors are conceptually
Chapter 7. AI Frameworks 275
7.6.3 JAX
JAX, developed by Google Research, is a newer entrant in the field of machine
learning frameworks. Unlike TensorFlow and PyTorch, which were primar-
ily designed for deep learning, JAX focuses on high-performance numerical
computing and advanced machine learning research. Its design philosophy
centers around functional programming principles and composition of trans-
formations, offering a fresh perspective on building and optimizing machine
learning systems.
JAX is built as a NumPy-like library with added capabilities for automatic
differentiation and just-in-time compilation. This foundation makes JAX feel
familiar to researchers accustomed to scientific computing in Python, while
providing powerful tools for optimization and acceleration. Where TensorFlow
uses static computational graphs and PyTorch employs dynamic ones, JAX takes
a different approach altogether, as it is a system for transforming numerical
functions.
One of JAX’s key features is its powerful automatic differentiation system.
Unlike TensorFlow’s static graph approach or PyTorch’s dynamic computation,
JAX can differentiate native Python and NumPy functions, including those with
loops, branches, and recursion. This capability extends beyond simple scalar-
to-scalar functions, allowing for complex transformations like vectorization
and JIT compilation. This flexibility is particularly valuable for researchers
exploring novel machine learning techniques and architectures.
JAX leverages XLA (Accelerated Linear Algebra) for just-in-time compilation,
similar to TensorFlow but with a more central role in its operation. This allows
JAX to optimize and compile Python code for various hardware accelerators,
including GPUs and TPUs. In contrast to PyTorch’s eager execution and Tensor-
Flow’s graph optimization, JAX’s approach can lead to significant performance
improvements, especially for complex computational patterns.
Where TensorFlow and PyTorch primarily use object-oriented and imperative
programming models, JAX embraces functional programming. This approach
encourages the use of pure functions and immutable data, which can lead to
more predictable and easier-to-optimize code. It’s a significant departure from
the stateful models common in other frameworks and can require a shift in
thinking for developers accustomed to TensorFlow or PyTorch.
JAX introduces a set of composable function transformations that set it apart
from both TensorFlow and PyTorch. These include automatic differentiation
(grad), just-in-time compilation, automatic vectorization (vmap), and paral-
lel execution across multiple devices (pmap). These transformations can be
composed, allowing for powerful and flexible operations that are not as straight-
forward in other frameworks.
7.7. Framework Specialization 276
Mobile ML frameworks also often include features for model updating and
versioning, allowing for the deployment of improved models without requir-
ing full app updates. Some frameworks support limited on-device learning,
enabling models to adapt to user behavior or environmental changes without
compromising data privacy.
The specializations of mobile ML frameworks collectively enable the de-
ployment of sophisticated ML models on resource-constrained mobile devices.
This expands the potential applications of AI in mobile environments, ranging
from real-time image and speech recognition to personalized user experiences.
However, effectively utilizing these frameworks requires careful consideration
of the target device capabilities, user experience requirements, and privacy
implications, necessitating a balance between model performance and resource
utilization.
Model
Training Yes No No
Inference Yes Yes Yes
(but inefÏcient on (and efÏcient) (and even more
edge) efÏcient)
How Many Ops ~1400 ~130 ~50
Native Quantization Tooling + Support No Yes Yes
Software
Needs an OS Yes Yes No
Memory Mapping of Models No Yes Yes
Delegation to accelerators Yes Yes No
Hardware
Base Binary Size 3 MB+ 100 KB ~10 KB
Base Memory Footprint ~5 MB 300 KB 20 KB
Optimized Architectures X86, TPUs, GPUs Arm Cortex A, x86 Arm Cortex M, DSPs, MCUs
7.9 Conclusion
AI frameworks have evolved from basic numerical libraries into sophisticated
software systems that shape how we develop and deploy machine learning
applications. The progression from early numerical computing to modern deep
learning frameworks demonstrates the field’s rapid technological advancement.
Modern frameworks like TensorFlow, PyTorch, and JAX implement distinct
approaches to common challenges in machine learning development. Each
framework offers varying tradeoffs between ease of use, performance, and
flexibility. TensorFlow emphasizes production deployment, PyTorch focuses on
research and experimentation, while JAX prioritizes functional programming
patterns.
The specialization of frameworks into cloud, edge, mobile, and tiny ML imple-
mentations reflects the diverse requirements of machine learning applications.
Cloud frameworks optimize for scalability and distributed computing. Edge
and mobile frameworks prioritize model efÏciency and reduced resource con-
sumption. TinyML frameworks target constrained environments with minimal
computing resources.
Understanding framework architecture, from tensor operations to execution
models, enables developers to select appropriate tools for specific use cases,
optimize application performance, debug complex computational graphs, and
deploy models across different computing environments.
7.10. Resources 286
7.10 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 8
AI Training
How do machine learning training workloads manifest as systems challenges, and what
architectural principles guide their efÏcient implementation?
Machine learning training is a unique class of computational workload that
demands careful orchestration of computation, memory, and data movement.
The process of transforming training algorithms into efÏcient system implemen-
tations requires understanding how mathematical operations map to hardware
resources, how data flows through memory hierarchies, and how system ar-
chitectures influence training performance. Investigating these system-level
considerations helps establish core principles for designing and optimizing
training infrastructure. By understanding and addressing these challenges,
we can develop more efÏcient and scalable solutions to meet the demands of
modern machine learning workloads.
287
8.1. Overview 288
L Learning Objectives
8.1 Overview
Machine learning has revolutionized modern computing by enabling systems
to learn patterns from data, with training being its cornerstone. This compu-
tationally intensive process involves adjusting millions, and even billions, of
parameters to minimize errors on training examples while ensuring the model
generalizes effectively to unseen data. The success of machine learning models
hinges on this training phase.
The training process brings together algorithms, data, and computational re-
sources into an integrated workflow. Models, particularly deep neural networks
used in domains such as computer vision and natural language processing,
require significant computational effort due to their complexity and scale. Even
resource-constrained models, such as those used in Mobile ML or Tiny ML
applications, require careful tuning to achieve an optimal balance between
accuracy, computational efÏciency, and generalization.
0
Model sizes have grown expo-
As models have grown in size and complexity0 , the systems that enable efÏ-
nentially since AlexNet (60M param-
cient training have become increasingly sophisticated. Training systems must
eters) in 2012, with modern large
coordinate computation across memory hierarchies, manage data movement,
language models like GPT-4 esti-
and optimize resource utilization, all while maintaining numerical stability
mated to have over 1 trillion param-
and convergence properties. This intersection of mathematical optimization
eters, which represents an increase
with systems engineering creates unique challenges in maximizing training
of over 16,000x in just over a decade.
throughput.
This chapter examines the key components and architecture of machine learn-
ing training systems. We discuss the design of training pipelines, memory and
computation systems, data management strategies, and advanced optimization
techniques. Additionally, we explore distributed training frameworks and their
role in scaling training processes. Real-world examples and case studies are
provided to connect theoretical principles to practical implementations, offer-
ing insight into the development of efÏcient, scalable, and effective training
systems.
Chapter 8. AI Training 289
IBM
System/360
Mainframe
ENIAC Figure 8.2: Timeline of major ad-
CDC 6600 vancements in computing systems
High-Performance
Computing
for machine learning, showing the
Cray-1 evolution from mainframes to AI hy-
Google Data percomputing systems.
Warehouse Scale Centers
Computing
AWS
NVIDIA GPU
AI Hypercomputing
Era
Google TPUs
Electronic computation began with the mainframe era. ENIAC (1945) estab-
lished the viability of electronic computation at scale, while the IBM System/360
(1964) introduced architectural principles of standardized instruction sets and
memory hierarchies. These fundamental concepts laid the groundwork for all
subsequent computing systems.
High-performance computing (HPC) systems (Thornton 1965) built upon
these foundations while specializing for scientific computation. The CDC 6600
and later systems like the CM-5 (T. M. Corporation 1992) optimized for dense
matrix operations and floating-point calculations.
HPC These systems implemented specific architectural features for scientific
workloads: high-bandwidth memory systems for array operations, vector pro-
cessing units for mathematical computations, and specialized interconnects for
collective communication patterns. Scientific computing demanded emphasis
on numerical precision and stability, with processors and memory systems
designed for regular, predictable access patterns. The interconnects supported
tightly synchronized parallel execution, enabling efÏcient collective operations
across computing nodes.
Warehouse-scale computing marked the next evolutionary step. Google’s
data center implementations (Barroso and Hölzle 2007a) introduced new opti-
mizations for internet-scale data processing. Unlike HPC systems focused on
8.2. Training Systems 290
Where:
• 𝐴(𝑙−1) represents the activations from the previous layer (or the input
layer for the first layer),
8.3. Mathematical Foundations 294
• 𝑊 (𝑙) is the weight matrix at layer 𝑙, which contains the parameters learned
by the network,
• 𝑏(𝑙) is the bias vector for layer 𝑙,
• 𝑓(⋅) is the activation function applied element-wise (e.g., ReLU, sigmoid)
to introduce non-linearity.
For example, tanh is widely used in recurrent neural networks (RNNs), where
its bounded and symmetric properties help stabilize learning dynamics over
time. While tanh has largely been replaced by ReLU in many modern architec-
tures due to its computational inefÏciencies and vanishing gradient issues, it
remains a viable choice in scenarios where its range and symmetry are benefi-
cial.
ReLU. The Rectified Linear Unit (ReLU) is one of the most widely used activa-
tion functions in modern neural networks. Its simplicity and effectiveness have
made it the default choice for most machine learning architectures. The ReLU
function is defined as:
ReLU(𝑥) = max(0, 𝑥)
This function outputs the input value if it is positive and zero otherwise.
Unlike sigmoid and tanh, which produce smooth, bounded outputs, ReLU
introduces sparsity in the network by setting all negative inputs to zero. This
sparsity can help reduce overfitting and improve computation efÏciency in
many scenarios.
ReLU is particularly effective in avoiding the vanishing gradient problem,
as it maintains a constant gradient for positive inputs. However, it introduces
another issue known as the dying ReLU problem, where neurons can become
permanently inactive if they consistently output zero. This occurs when the
weights cause the input to remain in the negative range. In such cases, the
neuron no longer contributes to learning.
ReLU is commonly used in the hidden layers of neural networks, particularly
in convolutional neural networks (CNNs) and machine learning models for
image and speech recognition tasks. Its computational simplicity and ability to
prevent vanishing gradients make it ideal for training deep architectures.
Softmax. The softmax function is a widely used activation function, primarily
applied in the output layer of classification models. It transforms raw scores
into a probability distribution, ensuring that the outputs sum to 1. This makes
it particularly suitable for multi-class classification tasks, where each output
represents the probability of the input belonging to a specific class.
The mathematical definition of the softmax function for a vector of inputs
z = [𝑧1 , 𝑧2 , … , 𝑧𝐾 ] is:
𝑒𝑧𝑖
𝜎(𝑧𝑖 ) = , 𝑖 = 1, 2, … , 𝐾
∑𝐾
𝑗=1
𝑒𝑧𝑗
Here, 𝐾 is the number of classes, 𝑧𝑖 represents the raw score (logit) for the 𝑖-th
class, and 𝜎(𝑧𝑖 ) is the probability of the input belonging to that class.
Softmax has several desirable properties that make it essential for classifi-
cation tasks. It converts arbitrary real-valued inputs into probabilities, with
each output value in the range (0, 1) and the sum of all outputs equal to 1. The
function is differentiable, which allows it to be used with gradient-based opti-
mization methods. Additionally, the probabilistic interpretation of its output is
crucial for tasks where confidence levels are needed, such as object detection or
language modeling.
8.3. Mathematical Foundations 298
1.10
1.05
Figure 8.4: Activation function per- 1.00
Execution Time (seconds)
formance.
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
Sigmoid Tanh ReLU Softmax
While these benchmark results provide valuable insights, they represent CPU-
only performance without hardware acceleration. In production environments,
modern hardware accelerators like GPUs can substantially alter the relative
performance characteristics of activation functions. System architects must
therefore consider their specific hardware environment and deployment context
when evaluating computational efÏciency.
The selection of activation functions requires careful balancing of compu-
tational considerations against mathematical properties. Key factors include
the function’s ability to mitigate vanishing gradients and introduce beneficial
sparsity in neural activations. Each major activation function presents distinct
advantages and challenges:
Chapter 8. AI Training 299
Sigmoid. The sigmoid function has smooth gradients and a bounded output in
the range (0, 1), making it useful in probabilistic settings. However, the compu-
tation of the sigmoid involves an exponential function, which becomes a key
consideration in both software and hardware implementations. In software, this
computation is expensive and inefÏcient, particularly for deep networks or large
datasets. Additionally, sigmoid suffers from vanishing gradients, especially for
large input values, which can hinder the learning process in deep architectures.
Its non-zero-centered output can also slow optimization, requiring more epochs
to converge.
These computational challenges are addressed differently in hardware. Mod-
ern accelerators like GPUs and TPUs typically avoid direct computation of the
exponential function, instead using lookup tables (LUTs) or piece-wise linear
approximations to balance accuracy with speed. While these hardware opti-
mizations help, the multiple memory lookups and interpolation calculations
still make sigmoid more resource-intensive than simpler functions like ReLU,
even on highly parallel architectures.
Tanh. The tanh function outputs values in the range (−1, 1), making it zero-
centered and helping to stabilize gradient-based optimization algorithms. This
zero-centered output helps reduce biases in weight updates, an advantage
over sigmoid. Like sigmoid, however, tanh involves exponential computations
that impact both software and hardware implementations. In software, this
computational overhead can slow training, particularly when working with
large datasets or deep models. While tanh helps prevent some of the saturation
issues associated with sigmoid, it still suffers from vanishing gradients for large
inputs, especially in deep networks.
In hardware, tanh leverages its mathematical relationship with sigmoid
(being essentially a scaled and shifted version) to optimize implementation.
Modern hardware often implement tanh using a hybrid approach: lookup
tables for common input ranges combined with piece-wise approximations for
edge cases. This approach helps balance accuracy with computational efÏciency,
though tanh remains more resource-intensive than simpler functions. Despite
these challenges, tanh remains common in RNNs and LSTMs where balanced
gradients are crucial.
ReLU. The ReLU function stands out for its mathematical simplicity: it passes
positive values unchanged and sets negative values to zero. This straightfor-
ward behavior has profound implications for both software and hardware
implementations. In software, ReLU’s simple thresholding operation results in
faster computation compared to sigmoid or tanh. It also helps prevent vanish-
ing gradients and introduces beneficial sparsity in activations, as many neurons
output zero. However, ReLU can suffer from the “dying ReLU” problem in
deep networks, where neurons become permanently inactive and never update
their weights.
The hardware implementation of ReLU showcases why it has become the
dominant activation function in modern neural networks. Its simple max(0, 𝑥)
operation requires just a single comparison and conditional set, translating to
minimal circuit complexity. Modern GPUs and TPUs can implement ReLU
using a simple multiplexer that checks the input’s sign bit, allowing for ex-
8.3. Mathematical Foundations 300
Softmax. The softmax function transforms raw logits into a probability distri-
bution, ensuring outputs sum to 1, making it essential for classification tasks.
Its computation involves exponentiating each input value and normalizing by
their sum, a process that becomes increasingly complex with larger output
spaces. In software, this creates significant computational overhead for tasks
like natural language processing, where vocabulary sizes can reach hundreds
of thousands of terms. However this is typically not a significant issue since
it is often only used in the final layer. The function also requires keeping all
values in memory during computation, as each output probability depends on
the entire input.
At the hardware level, softmax faces unique challenges because it can’t pro-
cess each value independently like other activation functions. Unlike ReLU’s
simple threshold or even sigmoid’s per-value computation, softmax needs
access to all values to perform normalization. This becomes particularly de-
manding in modern transformer architectures, where softmax computations in
attention mechanisms process thousands of values simultaneously. To manage
these demands, hardware implementations often use approximation techniques
or simplified versions of softmax, especially when dealing with large vocabu-
laries or attention mechanisms.
Table 8.2 summarizes the trade-offs of these commonly used activation func-
tions and highlights how these choices affect system performance.
Table 8.2: Comparison of different actiation functions and their advances and
distagnets anad system implications.
Func-
tion Key Advantages Key Disadvantages System Implications
Sig- Smooth gradients; Vanishing gradients; Exponential computation adds overhead; limited
moid bounded output in non-zero-centered scalability for deep networks on modern
(0, 1). output. accelerators.
Tanh Zero-centered output in Vanishing gradients More expensive than ReLU; effective for
(−1, 1); stabilizes for large inputs. RNNs/LSTMs but less common in CNNs and
gradients. Transformers.
ReLU Computationally Dying neurons; Simple operations optimize well on GPUs/TPUs;
efÏcient; avoids unbounded output. sparse activations reduce memory and
vanishing gradients; computation needs.
introduces sparsity.
Soft- Converts logits into Computationally High cost for large vocabularies; hierarchical or
max probabilities; sums to 1. expensive for large sampled softmax needed for scalability in NLP
outputs. tasks.
pensable despite its computational cost. Ultimately, the ideal activation function
depends on the specific task, network architecture, and hardware environment.
memory proportional to the depth of the network and the number of examples
being processed.
Traditional gradient descent processes the entire dataset in each iteration. For
a training set with 1 million examples, computing gradients requires evaluating
and storing results for each example before performing a parameter update.
This approach poses significant system challenges:
𝜃𝑡+1 = 𝜃𝑡 − 𝛼∇𝐿(𝜃𝑡 ; 𝑥𝑖 , 𝑦𝑖 )
1 𝐵
𝜃𝑡+1 = 𝜃𝑡 − 𝛼 ∑ ∇𝐿(𝜃𝑡 ; 𝑥𝑖 , 𝑦𝑖 )
𝐵 𝑖=1
where 𝛽 is the momentum coefÏcient, typically set between 0.9 and 0.99. From a
systems perspective, momentum introduces additional memory requirements.
The training system must maintain a velocity vector with the same dimen-
sionality as the parameter vector, effectively doubling the memory needed for
optimization state.
Adaptive Learning Rate Methods. RMSprop modifies the basic gradient de-
scent update by maintaining a moving average of squared gradients for each
parameter:
2
𝑠𝑡 = 𝛾𝑠𝑡−1 + (1 − 𝛾)(∇𝐿(𝜃𝑡 ))
∇𝐿(𝜃𝑡 )
𝜃𝑡+1 = 𝜃𝑡 − 𝛼 √
𝑠𝑡 + 𝜖
𝑚𝑡 = 𝛽1 𝑚𝑡−1 + (1 − 𝛽1 )∇𝐿(𝜃𝑡 )
2
𝑣𝑡 = 𝛽2 𝑣𝑡−1 + (1 − 𝛽2 )(∇𝐿(𝜃𝑡 ))
𝑚
𝜃𝑡+1 = 𝜃𝑡 − 𝛼 √ 𝑡
𝑣𝑡 + 𝜖
The system implications of Adam are more substantial than previous meth-
ods. The optimizer must store two additional vectors (𝑚𝑡 and 𝑣𝑡 ) for each
parameter, tripling the memory required for optimization state. For a model
with 100 million parameters using 32-bit floating-point numbers, the additional
memory requirement is approximately 800 MB.
MemorySGD = Sizeparams
MemoryMomentum = 2 × Sizeparams
MemoryAdam = 3 × Sizeparams
Bandwidthseparate = 5 × Sizeparams
Chapter 8. AI Training 305
Bandwidthfused = 2 × Sizeparams
backward pass:
where 𝑧 (𝑙) represents the pre-activation values and 𝑎(𝑙) represents the activations
at layer 𝑙. The storage of these intermediate values creates specific memory
requirements that scale with network depth and batch size.
The backward pass computes gradients by applying the chain rule, starting
from the network’s output and moving toward the input:
𝜕𝐿 𝜕𝐿
= (𝑙) ⊙ 𝑓 ′ (𝑧(𝑙) )
𝜕𝑧 (𝑙) 𝜕𝑎
𝜕𝐿 𝜕𝐿 𝑇
(𝑙)
= (𝑙) (𝑎(𝑙−1) )
𝜕𝑊 𝜕𝑧
Each gradient computation requires access to stored activations from the
forward pass, creating a specific pattern of memory access and computation
that training systems must manage efÏciently.
Both 𝑧 (𝑙) and 𝑎(𝑙) must be cached for the backward pass. This creates a
multiplicative effect on memory usage: each layer’s memory requirement is
multiplied by the batch size, and the optimizer’s memory overhead (discussed
in the previous section) applies to each parameter.
The total memory needed scales with:
• Network depth (number of layers)
• Layer widths (number of parameters per layer)
• Batch size (number of examples processed together)
• Optimizer state (additional memory for algorithms like Adam)
This creates a complex set of trade-offs. Larger batch sizes enable more
efÏcient computation and better gradient estimates for optimization, but re-
quire proportionally more memory for storing activations. More sophisticated
optimizers like Adam can achieve faster convergence but require additional
memory per parameter.
Evaluation
Data Pipeline Training Loop Metrics Evaluation Pipeline
Ingestion, Processed Forward Pass, Loss
Validation and
Preprocessing,
Figure 8.5: Training pipeline show- Batches Calculation,
Metrics Computation
Batching
ing the three main components. The Backward Pass Feedback
These components collectively process raw data, train the model, and assess its
performance, ensuring that the training process is efÏcient and effective.
The data pipeline initiates the process by ingesting raw data and transforming
it into a format suitable for the model. This data is passed to the training loop,
where the model performs its core computations to learn from the inputs.
Periodically, the evaluation pipeline assesses the model’s performance using
a separate validation dataset. This modular structure ensures that each stage
operates efÏciently while contributing to the overall workflow.
Compare
predicted
label with
annotation
GPU 1 GPU 1 GPU 1
This error signal is then propagated backward through the network us-
ing backpropagation, which applies the chain rule of differentiation to
compute gradients for each layer’s parameters. These gradients indicate
the necessary adjustments required to minimize the loss.
3. Step 3 – Update Parameters: The computed gradients are passed to an
optimizer, which updates the model’s parameters to minimize the loss.
Different optimization algorithms, such as SGD or Adam, influence how
the parameters are adjusted. The choice of optimizer impacts convergence
speed and stability.
This process repeats iteratively across multiple batches and epochs, gradually
refining the model to improve its predictive accuracy.
Figure 8.7: Data pipeline architec- Storage Zone CPU Preprocessing Zone Data GPU 1
ture illustrating the flow of data
from raw storage through CPU pre-
processing stages to GPU training
units. Raw Data Format Process Batch Data GPU 2
Data GPU 3
8.4.2.2 Preprocessing
As the data becomes available, data preprocessing transforms raw input data
into a format suitable for model training. This process, traditionally imple-
mented through Extract-Transform-Load (ETL) or Extract-Load-Transform
(ELT) pipelines, is a critical determinant of training system performance. The
throughput of preprocessing operations can be expressed mathematically as:
𝑁workers
𝑇preprocessing =
𝑡transform
where:
• 𝐵GPU_transfer represents GPU memory bandwidth
• 𝐵GPU_compute represents GPU computational throughput
This relationship illustrates a fundamental principle in training system de-
sign: the system’s overall performance is limited by its slowest component.
Whether preprocessing speed, data transfer rates, or computational capacity,
the bottleneck stage determines the effective training throughput of the en-
tire system. Understanding these relationships enables system architects to
design balanced training pipelines where preprocessing capacity aligns with
computational resources, ensuring optimal resource utilization.
hardware capabilities. The Pipeline Delivery Rate (𝑅pipeline ) is the rate at which
the data pipeline can deliver preprocessed images to the GPU.
In this case, at a high level, the system’s effective training speed is governed
by the lower of these two rates. When 𝑅pipeline is less than 𝑅GPU , the system
experiences underutilization of GPU resources. The degree of GPU utilization
can be expressed as:
𝑅pipeline
GPU Utilization = × 100%
𝑅GPU
Let us consider an example. A ResNet-50 model implemented on modern
GPU hardware might achieve a processing rate of 1000 images per second.
However, if the data pipeline can only deliver 200 images per second, the
GPU utilization would be merely 20%, meaning the GPU remains idle 80%
of the time. This results in significantly reduced training efÏciency. Impor-
tantly, this inefÏciency persists even with more powerful GPU hardware, as the
pipeline throughput becomes the limiting factor in system performance. This
demonstrates why balanced system design, where pipeline and computational
capabilities are well-matched, is crucial for optimal training performance.
This equation captures three components: storage read time (𝑡fetch? ), prepro-
cessing time (𝑡process ), and accelerator transfer time (𝑡transfer ).
Modern training architectures optimize performance by overlapping these
operations. When one batch undergoes preprocessing, the system simulta-
neously fetches the next batch from storage while transferring the previously
processed batch to accelerator memory.
This coordinated movement requires precise management of system re-
sources, particularly memory buffers and processing units. The memory hier-
archy must account for bandwidth disparities while maintaining continuous
Chapter 8. AI Training 315
data flow. Effective pipelining minimizes idle time and maximizes resource
utilization through careful buffer sizing and memory allocation strategies. The
successful orchestration of these components enables efÏcient training across
the memory hierarchy while managing the inherent bandwidth constraints of
each tier.
where 𝐵 represents the batch size, 𝐿 is the number of layers, and 𝐴𝑙 repre-
sents the activation size at layer 𝑙. This simple equation masks considerable
complexity in practice.
Consider ResNet-50 processing images at 224 × 224 resolution with a batch
size of 32. The initial convolutional layer produces activation maps of dimension
112 × 112 × 64. Using single-precision floating-point format (4 bytes per value),
this single layer’s activation storage requires approximately 98 MB. As the
network progresses through its 50 layers, the dimensions of these activation
maps change, typically decreasing in spatial dimensions while increasing in
channel depth, creating a cumulative memory demand that can reach several
gigabytes.
Modern GPUs typically provide between 16 and 24 GB of memory, which
must accommodate not just these activations but also model parameters, gradi-
ents, and optimization states. This constraint has motivated several memory
management strategies:
Activation checkpointing trades computational cost for memory efÏciency by
strategically discarding and recomputing activations during the backward pass.
Rather than storing all intermediate values, the system maintains checkpoints
at selected layers. During backpropagation, it regenerates necessary activations
from these checkpoints. While this approach can reduce memory usage by 50%
or more, it typically increases computation time by 20-30%.
Mixed precision training offers another approach to memory efÏciency. By
storing activations in half-precision (FP16) format instead of single-precision
(FP32), memory requirements are immediately halved. Modern hardware ar-
chitectures provide specialized support for these reduced-precision operations,
often maintaining computational throughput while saving memory.
The relationship between batch size and memory usage creates practical trade-
offs in training regimes. While larger batch sizes can improve computational
8.4. Pipeline Architecture 318
stored gradients, and write the modified parameters back to memory. Different
optimizers vary in their memory requirements and computational patterns,
directly affecting system performance and resource utilization.
BF16
Figure 8.8: Example memory foot-
print breakdown for the Llama-7B
model under different optimized
Adafactor Others training schemes. Note that in the
WeightGradient unoptimized bfloat16 case, how op-
Optimization timizer state and weight gradients
Activation combined can take up more than
8-bit Adam
Weight double the footprint of the model
weights.
RTX 4090 Memory Limit
8-bit GaLore
0 20 40 60 80
Memory Cost (BG)
This increase in memory usage directly affects the parameter update process,
as it determines how much data is available for computing gradients in each
iteration.
Larger batches tend to improve hardware utilization, particularly on GPUs
and TPUs optimized for parallel processing. This can lead to more efÏcient
parameter updates and faster training times, provided sufÏcient memory is
available.
However, there’s a trade-off to consider. While larger batches can improve
computational efÏciency by allowing more parallel computations during gradi-
ent calculation and parameter updates, they also require more memory. On
systems with limited memory, this might necessitate reducing the batch size,
potentially slowing down training or leading to less stable parameter updates.
8.5. Pipeline Optimizations 322
The choice of batch size interacts with various aspects of the optimization pro-
cess. For instance, it affects the frequency of parameter updates: larger batches
result in less frequent but potentially more impactful updates. Additionally,
batch size influences the behavior of adaptive optimization algorithms, which
may need to be tuned differently depending on the batch size. In distributed
training, which we discuss later, batch size often determines the degree of data
parallelism, impacting how gradient computations and parameter updates are
distributed across devices.
Determining the optimal batch size involves balancing these factors within
hardware constraints. It often requires experimentation to find the sweet spot
that maximizes both learning efÏciency and hardware utilization while ensuring
effective parameter updates.
8.5.1.1 Mechanics
Prefetching and overlapping optimize the training pipeline by enabling different
stages of data processing and computation to operate concurrently rather than
sequentially. These techniques maximize resource utilization by addressing
bottlenecks in data transfer and preprocessing.
As you recall, training data undergoes three main stages: retrieval from
storage, transformation into a suitable format, and utilization in model training.
An unoptimized pipeline executes these stages sequentially. The GPU remains
idle during data fetching and preprocessing, waiting for data preparation to
complete. This sequential execution creates significant inefÏciencies in the
training process.
Prefetching eliminates waiting time by loading data asynchronously during
model computation. Data loaders operate as separate threads or processes,
preparing the next batch while the current batch trains. This ensures immediate
data availability for the GPU when the current batch completes.
Overlapping extends this efÏciency by coordinating all three pipeline stages
simultaneously. As the GPU processes one batch, preprocessing begins on the
next batch, while data fetching starts for the subsequent batch. This coordina-
tion maintains constant activity across all pipeline stages.
8.5. Pipeline Optimizations 324
loader = DataLoader(dataset,
batch_size=32,
num_workers=4,
prefetch_factor=2)
8.5.1.2 Benefits
Prefetching and overlapping are powerful techniques that significantly enhance
the efÏciency of training pipelines by addressing key bottlenecks in data han-
dling and computation. To illustrate the impact of these benefits, Table 8.4
presents the following comparison:
idle while waiting for data to be fetched and preprocessed. This idle time
creates inefÏciencies, especially in workflows where data augmentation or
preprocessing involves complex transformations. By introducing asynchronous
data loading and overlapping, these techniques ensure that the GPU consistently
has data ready to process, eliminating unnecessary delays.
Another important benefit is the reduction in overall training time. Prefetch-
ing and overlapping allow the computational pipeline to operate continuously,
with multiple stages working simultaneously rather than sequentially. For
example, while the GPU processes the current batch, the data loader fetches
and preprocesses the next batch, ensuring a steady flow of data through the
system. This parallelism minimizes latency between training iterations, allow-
ing for faster completion of training cycles, particularly in scenarios involving
large-scale datasets.
Additionally, these techniques are highly scalable and adaptable to various
hardware configurations. Prefetching buffers and overlapping mechanisms
can be tuned to match the specific requirements of a system, whether the
bottleneck lies in slow storage, limited network bandwidth, or computational
constraints. By aligning the data pipeline with the capabilities of the underlying
hardware, prefetching and overlapping maximize resource utilization, making
them invaluable for large-scale machine learning workflows.
Overall, prefetching and overlapping directly address some of the most
common inefÏciencies in training pipelines. By optimizing data flow and
computation, these methods not only improve hardware efÏciency but also
enable the training of more complex models within shorter timeframes.
a BERT model training pipeline, these steps might process thousands of sen-
tences per batch. Prefetching allows this text processing to happen concurrently
with model training. Prefetching ensures that these transformations occur in
parallel with training, while overlapping optimizes data transfer and computa-
tion. This is especially useful in transformer-based models like BERT or GPT,
which require consistent throughput to maintain efÏciency given their high
computational demand.
Distributed training systems, which we will discuss next, involve multiple
GPUs or nodes, present another critical application for prefetching and over-
lapping. In distributed setups, network latency and data transfer rates often
become the primary bottleneck. Prefetching mitigates these issues by ensur-
ing that data is ready and available before it is required by any specific GPU.
Overlapping further optimizes distributed training pipelines by coordinating
the data preprocessing on individual nodes while the central computation
continues, thus reducing overall synchronization delays.
Beyond these domains, prefetching and overlapping are particularly valuable
in workflows involving large-scale datasets stored on remote or cloud-based
systems. When training on cloud platforms, the data may need to be fetched
over a network or from distributed storage, which introduces additional latency.
Using prefetching and overlapping in such cases helps minimize the impact
of these delays, ensuring that training proceeds smoothly despite slower data
access speeds.
These use cases illustrate how prefetching and overlapping address inefÏcien-
cies in various machine learning pipelines. By optimizing the flow of data and
computation, these techniques enable faster, more reliable training workflows
across a wide range of applications.
improve throughput up to a point, but beyond that, it may lead to contention for
CPU resources or even degrade performance due to excessive context switching.
Determining the optimal configuration often requires empirical testing, which
can be time-consuming. A common starting point is to set num_workers to the
number of CPU cores available. However, on a 16-core system processing large
images, using all cores for data loading might leave insufÏcient CPU resources
for other essential operations, potentially slowing down the entire pipeline.
Debugging also becomes more complex in pipelines that employ prefetching
and overlapping. Asynchronous data loading and multithreading or multi-
processing introduce potential race conditions, deadlocks, or synchronization
issues. Diagnosing errors in such systems can be challenging because the execu-
tion flow is no longer straightforward. Developers may need to invest additional
effort into monitoring, logging, and debugging tools to ensure that the pipeline
operates reliably.
Moreover, there are scenarios where prefetching and overlapping may offer
minimal benefits. For instance, in systems where storage access or network
bandwidth is significantly faster than the computation itself, these techniques
might not noticeably improve throughput. In such cases, the additional com-
plexity and memory overhead introduced by prefetching may not justify its
use.
Finally, prefetching and overlapping require careful coordination across dif-
ferent components of the training pipeline, such as storage, CPUs, and GPUs.
Poorly designed pipelines can lead to imbalances where one stage becomes a
bottleneck, negating the advantages of these techniques. For example, if the
data loading process is too slow to keep up with the GPU’s processing speed,
the benefits of overlapping will be limited.
Despite these challenges, prefetching and overlapping remain essential tools
for optimizing training pipelines when used appropriately. By understanding
and addressing their trade-offs, practitioners can implement these techniques
effectively, ensuring smoother and more efÏcient machine learning workflows.
but with reduced precision (3-4 decimal digits). This range preservation makes
bfloat16 particularly suited for deep learning training, as it handles large and
small gradients more effectively than FP16.
The hybrid approach proceeds in three main phases, as illustrated in Fig-
ure 8.11. During the forward pass, input data converts to reduced precision
(FP16 or bfloat16), and matrix multiplications execute in this format, including
activation function computations. In the gradient computation phase, the back-
ward pass calculates gradients in reduced precision, but results are stored in
FP32 master weights. Finally, during weight updates, the optimizer updates
the main weights in FP32, and these updated weights convert back to reduced
precision for the next forward pass.
5. Remove scale,
FP 32 Master 6. Apply FP 32 (+clip, etc.) Scaled FP 32 4. Copy
Weights Gradients Gradients
Figure 8.11: Mixed precision train-
ing flow.
Scaled FP 16
7. Copy
Gradients
1. Forward 2. Loss
FP 16 Pass Scaling 3. Backprop
Scaled FP 32
FP 16 Loss
Weights Loss
8.5.2.4 Benefits
Mixed-precision training offers several significant advantages that make it an
essential optimization technique for modern machine learning workflows. By
reducing memory usage and computational load, it enables practitioners to
train larger models, process bigger batches, and achieve faster results, all while
maintaining model accuracy and convergence.
One of the most prominent benefits of mixed-precision training is its sub-
stantial reduction in memory consumption. FP16 computations require only
10
half the memory of FP32 computations, which directly reduces the storage Transformers are neural net-
required for activations, weights, and gradients during training. For instance, works that use attention mecha-
a transformer model with 1 billion parameters requires 4 GB of memory for nisms to dynamically capture rela-
weights in FP32, but only 2 GB in FP16. This memory efÏciency allows for tionships between elements in se-
larger batch sizes, which can lead to more stable gradient estimates and faster quential data. Unlike traditional ar-
convergence. Additionally, with less memory consumed per operation, prac- chitectures, transformers can pro-
titioners can train deeper and more complex models on the same hardware, cess all sequence elements in par-
unlocking capabilities that were previously limited by memory constraints.10 allel through multi-head attention,
Another key advantage is the acceleration of computations. Modern GPUs, where each head learns different
such as those equipped with Tensor Cores, are specifically optimized for FP16 relationship patterns. This paral-
operations. These cores enable hardware to process more operations per cycle lelization enables efÏcient process-
compared to FP32, resulting in faster training times. For matrix multiplica- ing of long sequences, making trans-
tion operations, which constitute 80-90% of training computation time in large formers particularly effective for
models, FP16 can achieve 2-3× speedup compared to FP32. This computa- tasks like language modeling and
tional speedup becomes particularly noticeable in large-scale models, such as sequence translation.
8.5. Pipeline Optimizations 330
the size of activations, weights, and gradients exchanged between devices. For
example, in a distributed training setup with 8 GPUs, reducing tensor sizes
from FP32 to FP16 can halve the communication bandwidth requirements from
320 GB/s to 160 GB/s. This optimization is especially beneficial in cloud-based
environments, where resource allocation and cost efÏciency are paramount.
Additionally, mixed-precision training is increasingly used in areas such as
speech processing, generative modeling, and scientific simulations. Models in
these fields often have large data and parameter requirements that can push
the limits of traditional FP32 workflows. By optimizing memory usage and
leveraging the speedups provided by Tensor Cores, practitioners can train
state-of-the-art models faster and more cost-effectively.
The adaptability of mixed-precision training to diverse tasks and domains
underscores its importance in modern machine learning. Whether applied
to large-scale natural language models, computationally intensive vision ar-
chitectures, or distributed training environments, this technique empowers
researchers and engineers to push the boundaries of what is computationally
feasible.
8.5.3.1 Mechanics
Gradient accumulation and activation checkpointing operate on distinct princi-
ples, but both aim to optimize memory usage during training by modifying
how forward and backward computations are handled.
Gradient Accumulation. Gradient accumulation simulates larger batch sizes by
splitting a single effective batch into smaller “micro-batches.” As illustrated in
Figure 8.12, during each forward and backward pass, the gradients for a micro-
batch are computed and added to an accumulated gradient buffer. Instead
of immediately applying the gradients to update the model parameters, this
process repeats for several micro-batches. Once the gradients from all micro-
batches in the effective batch are accumulated, the parameters are updated
using the combined gradients.
This process allows models to achieve the benefits of training with larger
batch sizes, such as improved gradient estimates and convergence stability,
without requiring the memory to store an entire batch at once. For instance, in
PyTorch, this can be implemented by adjusting the learning rate proportionally
to the number of accumulated micro-batches and calling optimizer.step()
only after processing the entire effective batch.
The key steps in gradient accumulation are:
Chapter 8. AI Training 333
Losses Gradients
∂L1
∂x Figure 8.12: Gradient accumulation.
Batch 1 L1 δ1
Sum
∂L2
∂x
Batch 2 L2 δ2 δ1 +δ2 +δ3
∂L3
∂x
Batch 3 L3 δ3
Backward pass
Checkpoint
8.5.3.2 Benefits
Gradient accumulation and activation checkpointing provide solutions to the
memory limitations often encountered in training large-scale machine learning
models. By optimizing how memory is used during training, these techniques
enable the development and deployment of complex architectures, even on
hardware with constrained resources.
One of the primary benefits of gradient accumulation is its ability to simulate
larger batch sizes without increasing the memory requirements for storing the
full batch. Larger batch sizes are known to improve gradient estimates, leading
to more stable convergence and faster training. With gradient accumulation,
practitioners can achieve these benefits while working with smaller micro-
batches that fit within the GPU’s memory. This flexibility is useful when training
models on high-resolution data, such as large images or 3D volumetric data,
where even a single batch may exceed available memory.
Activation checkpointing, on the other hand, significantly reduces the mem-
ory footprint of intermediate activations during the forward pass. This allows
for the training of deeper models, which would otherwise be infeasible due to
memory constraints. By discarding and recomputing activations as needed,
checkpointing frees up memory that can be used for larger models, additional
layers, or higher resolution data. This is especially important in state-of-the-art
architectures, such as transformers or dense convolutional networks, which
require substantial memory to store intermediate computations.
Both techniques enhance the scalability of machine learning workflows. In
resource-constrained environments, such as cloud-based platforms or edge
devices, these methods provide a means to train models efÏciently without
Chapter 8. AI Training 335
8.5.4 Comparison
As summarized in Table 8.5, these techniques vary in their implementation
complexity, hardware requirements, and impact on computation speed and
memory usage. The selection of an appropriate optimization strategy depends
on factors such as the specific use case, available hardware resources, and the
nature of performance bottlenecks in the training process.
Compute Avg
Error loss global
function gradient
1 𝐵
𝑔= ∑ ∇ 𝐿(𝜃, 𝑥𝑖 )
𝐵 𝑖=1 𝜃
In data parallelism with 𝑁 devices, each device 𝑘 computes gradients on its
own minibatch 𝐵𝑘 :
1
𝑔𝑘 = ∑ ∇ 𝐿(𝜃, 𝑥𝑖 )
|𝐵𝑘 | 𝑥 ∈𝐵 𝜃
𝑖 𝑘
1 𝑁
𝑔global = ∑𝑔
𝑁 𝑘=1 𝑘
This averaging is mathematically equivalent to computing the gradient on
𝑁
the combined batch 𝐵total = ⋃𝑘=1 𝐵𝑘 :
1
𝑔global = ∑ ∇𝜃 𝐿(𝜃, 𝑥𝑖 )
|𝐵total | 𝑥 ∈𝐵
𝑖 total
This equivalence shows why data parallelism maintains the statistical prop-
erties of SGD training. The approach distributes distinct data subsets across
8.6. Distributed Systems 340
8.6.1.1 Mechanics
The process of data parallelism can be broken into a series of distinct steps,
each with its role in ensuring the system operates efÏciently. These steps are
illustrated in Figure 8.15.
Input Data
Gradients GPU N
Synchronize Gradients
Gradient
Aggregation
Aggregate Gradients
and Update Parameters
Dataset Splitting. The first step in data parallelism involves dividing the
dataset into smaller, non-overlapping subsets. This ensures that each device
processes a unique portion of the data, avoiding redundancy and enabling efÏ-
cient utilization of available hardware. For instance, with a dataset of 100,000
training examples and 4 GPUs, each GPU would be assigned 25,000 examples.
Modern frameworks like PyTorch’s DistributedSampler handle this distribution
automatically, implementing prefetching and caching mechanisms to ensure
data is readily available for processing.
Chapter 8. AI Training 341
Device Forward Pass. Once the data subsets are distributed, each device per-
forms the forward pass independently. During this stage, the model processes
its assigned batch of data, generating predictions and calculating the loss. For
example, in a ResNet-50 model, each GPU would independently compute the
convolutions, activations, and final loss for its batch. The forward pass is com-
putationally intensive and benefits from hardware accelerators like NVIDIA
V100 GPUs or Google TPUs, which are optimized for matrix operations.
Backward Pass and Calculation. Following the forward pass, each device
computes the gradients of the loss with respect to the model’s parameters
during the backward pass. Modern frameworks like PyTorch and TensorFlow
handle this automatically through their autograd systems. For instance, if
a model has 50 million parameters, each device calculates gradients for all
parameters but based only on its local data subset.
Gradient Synchronization. To maintain consistency across the distributed
system, the gradients computed by each device must be synchronized. This step
typically uses the ring all-reduce algorithm, where each GPU communicates
only with its neighbors, reducing communication overhead. For example, with
8 GPUs, each sharing gradients for a 100MB model, ring all-reduce requires only
7 communication steps instead of the 56 steps needed for naive peer-to-peer
synchronization.
Parameter Updating. After gradient aggregation11 , each device independently
11
updates model parameters using the chosen optimization algorithm, such as The choice between sum-
SGD with momentum or Adam. This decentralized update strategy, imple- ming or averaging gradients im-
mented in frameworks like PyTorch’s DistributedDataParallel (DDP), enables pacts model training dynamics. Gra-
efÏcient parameter updates without requiring a central coordination server. dient summation requires scaling
Since all devices have identical gradient values after synchronization, they the learning rate by the number of
perform mathematically equivalent updates to maintain model consistency workers to maintain consistent up-
across the distributed system. date magnitudes. While gradient
For example, in a system with 8 GPUs training a ResNet model, each GPU averaging provides more stable up-
computes local gradients based on its data subset. After gradient averaging via dates with reduced variance, it re-
ring all-reduce, every GPU has the same global gradient values. Each device quires a central coordination node
then independently applies these gradients using the optimizer’s update rule. If that can become a bottleneck as the
using SGD with learning rate 0.1, the update would be weights = weights - number of workers increases. The
0.1 * gradients. This process maintains mathematical equivalence to single- decision depends on the specific
device training while enabling distributed computation. distributed training setup and op-
This process, which involves splitting data, performing computations, syn- timization goals.
chronizing results, and updating parameters, repeats for each batch of data.
Modern frameworks automate this cycle, allowing developers to focus on model
architecture and hyperparameter tuning rather than distributed computing
logistics.
8.6.1.2 Benefits
Data parallelism offers several key benefits that make it the predominant ap-
proach for distributed training. By splitting the dataset across multiple devices
and allowing each device to train an identical copy of the model, this approach
effectively addresses the core challenges in modern AI training systems.
8.6. Distributed Systems 342
The primary advantage of data parallelism is its linear scaling capability with
large datasets. As datasets grow into the terabyte range, processing them on a
single machine becomes prohibitively time-consuming. For example, training
a vision transformer on ImageNet (1.2 million images) might take weeks on a
single GPU, but only days when distributed across 8 GPUs. This scalability is
particularly valuable in domains like language modeling, where datasets can
exceed billions of tokens.
Hardware utilization efÏciency represents another crucial benefit. Data par-
allelism maintains high GPU utilization rates, typically, above 85%, by ensur-
ing each device actively processes its data portion. Modern implementations
achieve this through asynchronous data loading and gradient computation
overlapping with communication. For instance, while one batch computes
gradients, the next batch’s data is already being loaded and preprocessed.
Implementation simplicity sets data parallelism apart from other distribution
strategies. Modern frameworks have reduced complex distributed training
to just a few lines of code. For example, converting a PyTorch model to use
data parallelism often requires only wrapping it in DistributedDataParallel
and initializing a distributed environment. This accessibility has contributed
significantly to its widespread adoption in both research and industry.
The approach also offers remarkable flexibility across model architectures.
Whether training a ResNet (vision), BERT (language), or Graph Neural Network
(graph data), the same data parallelism principles apply without modification.
This universality makes it particularly valuable as a default choice for dis-
tributed training.
Training time reduction is perhaps the most immediate benefit. Given proper
implementation, data parallelism can achieve near-linear speedup with addi-
tional devices. Training that takes 100 hours on a single GPU might complete
in roughly 13 hours on 8 GPUs, assuming efÏcient gradient synchronization
and minimal communication overhead.
While these benefits make data parallelism compelling, it’s important to
note that achieving these advantages requires careful system design. The next
section examines the challenges that must be addressed to fully realize these
benefits.
8.6.1.3 Challenges
While data parallelism is a powerful approach for distributed training, it in-
troduces several challenges that must be addressed to achieve efÏcient and
scalable training systems. These challenges stem from the inherent trade-offs
between computation and communication, as well as the limitations imposed
by hardware and network infrastructures.
Communication overhead represents the most significant bottleneck in data
12
A communication strat- parallelism. During gradient synchronization, each device must exchange
egy that minimizes data transfer gradient updates—often hundreds of megabytes per step for large models.
overhead by organizing devices in With 8 GPUs training a 1-billion-parameter model, each synchronization step
a ring topology, first introduced might require transferring several gigabytes of data across the network. While
for distributed machine learning in high-speed interconnects like NVLink (300 GB/s) or InfiniBand (200 Gb/s) help,
Horovod. the overhead remains substantial. NCCL’s ring-allreduce12 algorithm reduces
Chapter 8. AI Training 343
8.6.2.1 Mechanics
Model parallelism divides neural networks across multiple computing devices,
with each device computing a distinct portion of the model’s operations. This
division allows training of models whose parameter counts exceed single-device
memory capacity. The technique encompasses device coordination, data flow
management, and gradient computation across distributed model segments.
The mechanics of model parallelism are illustrated in Figure 8.16. These steps
are described next:
Forward Pass Intermediate Data Intermediate Data Output
Model Partitioning. The first step in model parallelism is dividing the model
into smaller segments. For instance, in a deep neural network, layers are often
divided among devices. In a system with two GPUs, the first half of the layers
might reside on GPU 1, while the second half resides on GPU 2. Another
approach is to split computations within a single layer, such as dividing matrix
multiplications in transformer models across devices.
Model Forward Pass. During the forward pass, data flows sequentially through
the partitions. For example, data processed by the first set of layers on GPU
1 is sent to GPU 2 for processing by the next set of layers. This sequential
flow ensures that the entire model is used, even though it is distributed across
multiple devices. EfÏcient inter-device communication is crucial to minimize
delays during this step (Research 2021).
Backward Pass and Calculation. The backward pass computes gradients
through the distributed model segments in reverse order. Each device cal-
culates local gradients for its parameters and propagates necessary gradient
information to previous devices. In transformer models, this means backprop-
agating through attention computations and feed-forward networks across
device boundaries.
For example, in a two-device setup with attention mechanisms split between
devices, the backward computation works as follows: The second device com-
putes gradients for the final feed-forward layers and attention heads. It then
Chapter 8. AI Training 345
sends the gradient tensors for the attention output to the first device. The
first device uses these received gradients to compute updates for its attention
parameters and earlier layer weights.
Parameter Updates. Parameter updates occur independently on each device
using the computed gradients and an optimization algorithm. A device holding
attention layer parameters applies updates using only the gradients computed
for those specific parameters. This localized update approach differs from data
parallelism, which requires gradient averaging across devices.
The optimization step proceeds as follows: Each device applies its chosen
optimizer (such as Adam or AdaFactor) to update its portion of the model
parameters. A device holding the first six transformer layers updates only
those layers’ weights and biases. This local parameter update eliminates the
need for cross-device synchronization during the optimization step, reducing
communication overhead.
Iterative Process. Like other training strategies, model parallelism repeats
these steps for every batch of data. As the dataset is processed over multiple
iterations, the distributed model converges toward optimal performance.
Parallelism Variations. Model parallelism can be implemented through dif-
ferent strategies for dividing the model across devices. The three primary ap-
proaches are layer-wise partitioning, operator-level partitioning, and pipeline
parallelism, each suited to different model structures and computational needs.
Layer-wise Partitioning. Layer-wise partitioning assigns distinct model layers
to separate computing devices. In transformer architectures, this translates to
specific devices managing defined sets of attention and feed-forward blocks.
As illustrated in Figure 8.17, a 24-layer transformer model distributed across
four devices assigns six consecutive transformer blocks to each device. Device
1 processes blocks 1-6, device 2 handles blocks 7-12, and so forth.
Blocks 1-6 Blocks 7-12 Blocks 13-18 Blocks 19-24 Figure 8.17: Example of pipeline
parallelism.
This sequential processing introduces device idle time, as each device must
wait for the previous device to complete its computation before beginning work.
For example, while device 1 processes the initial blocks, devices 2, 3, and 4
remain inactive. Similarly, when device 2 begins its computation, device 1
sits idle. This pattern of waiting and idle time reduces hardware utilization
efÏciency compared to other parallelization strategies.
Layer-wise partitioning assigns distinct model layers to separate computing
devices. In transformer architectures, this translates to specific devices manag-
8.6. Distributed Systems 346
2 computes the remaining 8192 features. This division reduces peak memory
usage while maintaining mathematical equivalence to the original computation.
Summary. Each of these partitioning methods addresses specific challenges in
training large models, and their applicability depends on the model architecture
and the resources available. By selecting the appropriate strategy, practitioners
can train models that exceed the limits of individual devices, enabling the
development of cutting-edge machine learning systems.
8.6.2.2 Benefits
Model parallelism offers several significant benefits, making it an essential
strategy for training large-scale models that exceed the capacity of individual
devices. These advantages stem from its ability to partition the workload across
multiple devices, enabling the training of more complex and resource-intensive
architectures.
Memory scaling represents the primary advantage of model parallelism. Cur-
rent transformer architectures contain up to hundreds of billions of parameters.
A 175 billion parameter model with 32-bit floating point precision requires 700
GB of memory just to store its parameters. When accounting for activations,
optimizer states, and gradients during training, the memory requirement multi-
plies several fold. Model parallelism makes training such architectures feasible
by distributing these memory requirements across devices.
Another key advantage is the efÏcient utilization of device memory and com-
pute power. Since each device only needs to store and process a portion of the
model, memory usage is distributed across the system. This allows practition-
ers to work with larger batch sizes or more complex layers without exceeding
memory limits, which can also improve training stability and convergence.
Model parallelism also provides flexibility for different model architectures.
Whether the model is sequential, as in many natural language processing tasks,
or composed of computationally intensive operations, as in attention-based
models or convolutional networks, there is a partitioning strategy that fits the
architecture. This adaptability makes model parallelism applicable to a wide
variety of tasks and domains.
Finally, model parallelism is a natural complement to other distributed train-
ing strategies, such as data parallelism and pipeline parallelism. By combining
these approaches, it becomes possible to train models that are both large in
size and require extensive data. This hybrid flexibility is especially valuable in
cutting-edge research and production environments, where scaling models and
datasets simultaneously is critical for achieving state-of-the-art performance.
While model parallelism introduces these benefits, its effectiveness depends
on the careful design and implementation of the partitioning strategy. In the
next section, we will discuss the challenges associated with model parallelism
and the trade-offs involved in its use.
8.6.2.3 Challenges
While model parallelism provides a powerful approach for training large-scale
models, it also introduces unique challenges. These challenges arise from the
8.6. Distributed Systems 348
8.6.3.1 Mechanics
Hybrid parallelism operates by combining the processes of model partition-
ing and dataset splitting, ensuring efÏcient utilization of both memory and
computation across devices. This integration allows large-scale machine learn-
ing systems to overcome the constraints imposed by individual parallelism
strategies.
Model and Data Partitioning. Hybrid parallelism divides both model archi-
tecture and training data across devices. The model divides through layer-wise
or operator-level partitioning, where GPUs process distinct neural network
segments. Simultaneously, the dataset splits into subsets, allowing each device
group to train on different batches. A transformer model might distribute its
attention layers across four GPUs, while each GPU group processes a unique
1,000-example batch. This dual partitioning distributes memory requirements
and computational workload.
Forward Pass. During the forward pass, input data flows through the dis-
tributed model. Each device processes its assigned portion of the model using
the data subset it holds. For example, in a hybrid system with four devices,
two devices might handle different layers of the model (model parallelism)
while simultaneously processing distinct data batches (data parallelism). Com-
munication between devices ensures that intermediate outputs from model
partitions are passed seamlessly to subsequent partitions.
Backward Pass and Gradient Calculation. During the backward pass, gradi-
ents are calculated for the model partitions stored on each device. Data-parallel
devices that process the same subset of the model but different data batches
aggregate their gradients, ensuring that updates reflect contributions from
the entire dataset. For model-parallel devices, gradients are computed locally
and passed to the next layer in reverse order. In a two-device model-parallel
configuration, for example, the first device computes gradients for layers 1-3,
then transmits these to the second device for layers 4-6. This combination of
gradient synchronization and inter-device communication ensures consistency
across the distributed system.
Parameter Updates. After gradient synchronization, model parameters are
updated using the chosen optimization algorithm. Devices working in data
parallelism update their shared model partitions consistently, while model-
parallel devices apply updates to their local segments. EfÏcient communication
8.6. Distributed Systems 350
is critical in this step to minimize delays and ensure that updates are correctly
propagated across all devices.
8.6.3.2 Benefits
The adoption of hybrid parallelism in machine learning systems addresses
some of the most significant challenges posed by the ever-growing scale of
models and datasets. By blending the strengths of model parallelism and data
parallelism, this approach provides a comprehensive solution to scaling modern
machine learning workloads.
One of the most prominent benefits of hybrid parallelism is its ability to scale
seamlessly across both the model and the dataset. Modern neural networks,
particularly transformers used in natural language processing and vision ap-
plications, often contain billions of parameters. These models, paired with
massive datasets, make training on a single device impractical or even impos-
sible. Hybrid parallelism enables the division of the model across multiple
devices to manage memory constraints while simultaneously distributing the
dataset to process vast amounts of data efÏciently. This dual capability ensures
that training systems can handle the computational and memory demands of
the largest models and datasets without compromise.
Another critical advantage lies in hardware utilization. In many distributed
training systems, inefÏciencies can arise when devices sit idle during different
stages of computation or synchronization. Hybrid parallelism mitigates this
issue by ensuring that all devices are actively engaged. Whether a device is
computing forward passes through its portion of the model or processing data
batches, hybrid strategies maximize resource usage, leading to faster training
times and improved throughput.
Flexibility is another hallmark of hybrid parallelism. Machine learning mod-
els vary widely in architecture and computational demands. For instance,
convolutional neural networks prioritize spatial data processing, while trans-
formers require intensive operations like matrix multiplications in attention
mechanisms. Hybrid parallelism adapts to these diverse needs by allowing
practitioners to apply model and data parallelism selectively. This adaptability
ensures that hybrid approaches can be tailored to the specific requirements of
a given model, making it a versatile solution for diverse training scenarios.
Moreover, hybrid parallelism reduces communication bottlenecks, a common
issue in distributed systems. By striking a balance between distributing model
computations and spreading data processing, hybrid strategies minimize the
amount of inter-device communication required during training. This efÏcient
coordination not only speeds up the training process but also enables the
effective use of large-scale distributed systems where network latency might
otherwise limit performance.
Finally, hybrid parallelism supports the ambitious scale of modern AI re-
search and development. It provides a framework for leveraging cutting-edge
hardware infrastructures, including clusters of GPUs or TPUs, to train models
that push the boundaries of what’s possible. Without hybrid parallelism, many
of the breakthroughs in AI, including large language models and advanced
vision systems, would remain unattainable due to resource limitations.
8.6. Distributed Systems 352
8.6.3.3 Challenges
While hybrid parallelism provides a robust framework for scaling machine learn-
ing training, it also introduces complexities that require careful consideration.
These challenges stem from the intricate coordination needed to integrate both
model and data parallelism effectively. Understanding these obstacles is crucial
for designing efÏcient hybrid systems and avoiding potential bottlenecks.
One of the primary challenges of hybrid parallelism is communication over-
head. Both model and data parallelism involve significant inter-device commu-
nication. In model parallelism, devices must exchange intermediate outputs and
gradients to maintain the sequential flow of computation. In data parallelism,
gradients computed on separate data subsets must be synchronized across
devices. Hybrid parallelism compounds these demands, as it requires efÏcient
communication for both processes simultaneously. If not managed properly,
the resulting overhead can negate the benefits of parallelization, particularly in
large-scale systems with slower interconnects or high network latency.
Another critical challenge is the complexity of implementation. Hybrid par-
allelism demands a nuanced understanding of both model and data parallelism
techniques, as well as the underlying hardware and software infrastructure.
Designing efÏcient hybrid strategies involves making decisions about how to
partition the model, how to distribute data, and how to synchronize computa-
tions across devices. This process often requires extensive experimentation and
optimization, particularly for custom architectures or non-standard hardware
setups. While modern frameworks like PyTorch and TensorFlow provide tools
for distributed training, implementing hybrid parallelism at scale still requires
significant engineering expertise.
Workload balancing also presents a challenge in hybrid parallelism. In a dis-
tributed system, not all devices may have equal computational capacity. Some
devices may process data or compute gradients faster than others, leading
to inefÏciencies as faster devices wait for slower ones to complete their tasks.
Additionally, certain model layers or operations may require more resources
than others, creating imbalances in computational load. Managing this dispar-
ity requires careful tuning of partitioning strategies and the use of dynamic
workload distribution techniques.
Memory constraints remain a concern, even in hybrid setups. While model
parallelism addresses the issue of fitting large models into device memory, the
additional memory requirements for data parallelism, such as storing multiple
data batches and gradient buffers, can still exceed available capacity. This is
especially true for models with extremely large intermediate computations,
such as transformers with high-dimensional attention mechanisms. Balancing
memory usage across devices is essential to prevent resource exhaustion during
training.
Chapter 8. AI Training 353
Lastly, hybrid parallelism poses challenges related to fault tolerance and de-
bugging. Distributed systems are inherently more prone to hardware failures
and synchronization errors. Debugging issues in hybrid setups can be signif-
icantly more complex than in standalone model or data parallelism systems,
as errors may arise from interactions between the two approaches. Ensuring
robust fault-tolerance mechanisms and designing tools for monitoring and
debugging distributed systems are essential for maintaining reliability.
Despite these challenges, hybrid parallelism remains an indispensable strat-
egy for training state-of-the-art machine learning models. By addressing these
obstacles through optimized communication protocols, intelligent partition-
ing strategies, and robust fault-tolerance systems, practitioners can unlock the
full potential of hybrid parallelism and drive innovation in AI research and
applications.
8.6.4 Comparison
The features of data parallelism, model parallelism, and hybrid parallelism are
summarized in Table 8.6. This comparison highlights their respective focuses,
memory requirements, communication overheads, scalability, implementation
complexity, and ideal use cases. By examining these factors, practitioners can
determine the most suitable approach for their training needs.
Table 8.6: Comparison of data parallelism, model parallelism, and hybrid par-
allelism across key aspects.
Aspect Data Parallelism Model Parallelism Hybrid Parallelism
Focus Distributes dataset across Distributes the model Combines model and
devices, each with a full across devices, each data parallelism for
model copy handling a portion of the balanced scalability
model
Memory Requirement High (entire model on Low (model split across Moderate (splits model
per Device each device) devices) and dataset across
devices)
Communication Moderate to High High (communication for Very High (requires
Overhead (gradient intermediate activations synchronization for both
synchronization across and gradients) model and data)
devices)
Scalability Good for large datasets Good for very large Excellent for extremely
with moderate model models with smaller large models and
sizes datasets datasets
Implementation Low to Moderate Moderate to High High (complex
Complexity (relatively (requires careful integration of model and
straightforward with partitioning and data parallelism)
existing tools) coordination)
Ideal Use Case Large datasets where Extremely large models Training massive models
model fits within a single that exceed single-device on vast datasets in
device memory limits large-scale systems
is best viewed as a foundational tool for understanding the trade-offs and de-
cision points in parallelism strategy selection. Practitioners should consider
this guideline as a starting point and adapt it to the specific requirements and
constraints of their systems to achieve optimal performance.
Start
No
No
No
Scaling Scaling
Model Model No
End
8.8.1 GPUs
Machine learning training systems demand immense computational power
to process large datasets, perform gradient computations, and update model
parameters efÏciently. GPUs have emerged as a critical technology to meet
these requirements (Figure 8.21), primarily due to their highly parallelized
architecture and ability to execute the dense linear algebra operations central
to neural network training (Dally, Keckler, and Kirk 2021).
8.8. Specialized Hardware Training 358
Despite their advantages, GPUs are not without challenges. Effective utiliza-
tion of GPUs demands careful attention to workload balancing and inter-device
communication. Training systems must also consider the cost implications, as
GPUs are resource-intensive and require optimized data centers to operate at
scale. However, with innovations like NVLink and CUDA-X libraries, these
challenges are continually being addressed.
In conclusion, GPUs are indispensable for modern machine learning training
systems due to their versatility, scalability, and integration with advanced
software frameworks. By addressing key bottlenecks in computation, memory,
and distribution, GPUs play a foundational role in enabling the large-scale
training pipelines discussed throughout this chapter.
8.8.2 TPUs
Tensor Processing Units (TPUs) and other custom accelerators have been purpose-
built to address the unique challenges of large-scale machine learning training.
Unlike GPUs, which are versatile and serve a wide range of applications, TPUs
are specifically optimized for the computational patterns found in deep learning,
such as matrix multiplications and convolutional operations (Jouppi, Young, et
al. 2017c). These devices mitigate training bottlenecks by offering high through-
put, specialized memory handling, and tight integration with machine learning
frameworks.
As illustrated in Figure 8.22, TPUs have undergone significant architectural
evolution, with each generation introducing enhancements tailored for increas-
ingly demanding AI workloads. The first-generation TPU, introduced in 2015,
was designed for internal inference acceleration. Subsequent iterations have fo-
cused on large-scale distributed training, memory optimizations, and efÏciency
improvements, culminating in the most recent Trillium architecture. These
advancements illustrate how domain-specific accelerators continue to push the
boundaries of AI performance and efÏciency.
8.8.3 FPGAs
Field-Programmable Gate Arrays (FPGAs) are versatile hardware solutions
that allow developers to tailor their architecture for specific machine learning
workloads. Unlike GPUs or TPUs, which are designed with fixed architectures,
FPGAs can be reconfigured dynamically, offering a unique level of flexibility.
This adaptability makes them particularly valuable for applications that require
customized optimizations, low-latency processing, or experimentation with
novel algorithms.
Microsoft had been exploring the use of FPGAs for a while, as seen in Fig-
ure 8.23, with one prominent example being Project Brainwave. This initiative
leverages FPGAs to accelerate machine learning workloads in the Azure cloud.
Microsoft chose FPGAs for their ability to provide low-latency inference (not
training) while maintaining high throughput. This approach is especially bene-
ficial in scenarios where real-time predictions are critical, such as search engine
queries or language translation services. By integrating FPGAs directly into
their data center network, Microsoft has achieved significant performance gains
while minimizing power consumption.
8.8.4 ASICs
Application-Specific Integrated Circuits (ASICs) represent a class of hardware
designed for specific tasks, offering unparalleled efÏciency and performance by
eschewing the general-purpose flexibility of GPUs or FPGAs. Among the most
innovative examples of ASICs for machine learning training is the Cerebras
Wafer-Scale Engine (WSE), as shown in Figure 8.24, which stands apart for its
unique approach to addressing the computational and memory challenges of
training massive machine learning models.
smaller chips. This architecture enables Cerebras to pack 2.6 trillion transis-
tors and 850,000 cores onto a single device. These cores are connected via a
high-bandwidth, low-latency interconnect, allowing data to move across the
chip without the bottlenecks associated with external communication between
discrete GPUs or TPUs (Feldman et al. 2020).
From a machine learning training perspective, the WSE addresses several
critical bottlenecks:
1. Data Movement: In traditional distributed systems, significant time is
spent transferring data between devices. The WSE eliminates this by keep-
ing all computations and memory on a single wafer, drastically reducing
communication overhead.
2. Memory Bandwidth: The WSE integrates 40 GB of high-speed on-chip
memory directly adjacent to its processing cores. This proximity allows
for near-instantaneous access to data, overcoming the latency challenges
that GPUs often face when accessing off-chip memory.
3. Scalability: While traditional distributed systems rely on complex soft-
ware frameworks to manage multiple devices, the WSE simplifies scaling
by consolidating all resources into one massive chip. This design is par-
ticularly well-suited for training large language models and other deep
learning architectures that require significant parallelism.
8.9 Conclusion
AI training systems are built upon a foundation of mathematical principles,
computational strategies, and architectural considerations. The exploration
of neural network computation has shown how core operations, activation
functions, and optimization algorithms come together to enable efÏcient model
training, while also emphasizing the trade-offs that must be balanced between
memory, computation, and performance.
The design of training pipelines incorporates key components such as data
flows, forward and backward passes, and memory management. Understand-
ing these elements in conjunction with hardware execution patterns is essential
for achieving efÏcient and scalable training processes. Strategies like parameter
updates, prefetching, and gradient accumulation further enhance the effective-
ness of training by optimizing resource utilization and reducing computational
bottlenecks.
Distributed training systems, including data parallelism, model parallelism,
and hybrid approaches, are topics that we examined as solutions for scaling
AI training to larger datasets and models. Each approach comes with its own
benefits and challenges, highlighting the need for careful consideration of
system requirements and resource constraints.
Altogether, the combination of theoretical foundations and practical imple-
mentations forms a cohesive framework for addressing the complexities of
AI training. By leveraging this knowledge, it is possible to design robust, efÏ-
cient systems capable of meeting the demands of modern machine learning
applications.
8.10 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 9
EfÏcient AI
Purpose
What principles guide the efÏcient design of machine learning systems, and why is
understanding the interdependence of key resources essential?
Machine learning systems are shaped by the complex interplay among data,
models, and computing resources. Decisions on efÏciency in one dimension
often have ripple effects in the others, presenting both opportunities for synergy
and inevitable trade-offs. Understanding these individual components and
their interdependencies exposes not only how systems can be optimized but
also why these optimizations are crucial for achieving scalability, sustainabil-
ity, and real-world applicability. The relationship between data, model, and
computing efÏciency forms the basis for designing machine learning systems
that maximize capabilities while working within resource limitations. Each
efÏciency decision represents a balance between performance and practicality,
365
9.1. Overview 366
L Learning Objectives
9.1 Overview
Machine learning systems have become ubiquitous, permeating nearly every
aspect of modern life. As these systems grow in complexity and scale, they
must operate effectively across a wide range of deployments and scenarios.
This necessitates careful consideration of factors such as processing speed,
memory usage, and power consumption to ensure that models can handle large
workloads, operate on energy-constrained devices, and remain cost-effective.
Achieving this balance involves navigating trade-offs. For instance, in au-
tonomous vehicles, reducing a model’s size to fit the low-power constraints of
an edge device in a car might slightly decrease accuracy, but it ensures real-time
processing and decision-making. Conversely, a cloud-based system can afford
higher model complexity for improved accuracy, though this often comes at
the cost of increased latency and energy consumption. In the medical field, de-
ploying machine learning models on portable devices for diagnostics requires
efÏcient models that can operate with limited computational resources and
power, ensuring accessibility in remote or resource-constrained areas. Con-
versely, hospital-based systems can leverage more powerful hardware to run
complex models for detailed analysis, albeit with higher energy demands.
Understanding and managing these trade-offs is crucial for designing ma-
chine learning systems that meet diverse application needs within real-world
constraints. The implications of these design choices extend beyond perfor-
mance and cost. EfÏcient systems can be deployed across diverse environments,
from cloud infrastructures to edge devices, enhancing accessibility and adop-
tion. Additionally, they help reduce the environmental impact of machine
learning workloads by lowering energy consumption and carbon emissions,
aligning technological progress with ethical and ecological responsibilities.
This chapter focuses on the ‘why’ and ‘how’ of efÏciency in machine learning
systems. By establishing the foundational principles of efÏcient AI and explor-
Chapter 9. EfÏcient AI 367
ing strategies to achieve it, this chapter sets the stage for deeper discussions
on topics such as scaling, optimization, deployment, and sustainability in later
chapters.
Irreducible Error
ℒ(𝑁 ) = 𝐴𝑁 −𝛼 + 𝐵
where ℒ(𝑁 ) represents the loss achieved with resource quantity 𝑁, 𝐴 and 𝐵 are
task-dependent constants, and 𝛼 is the scaling exponent that characterizes the
rate of performance improvement. A larger value of 𝛼 signifies that performance
improvements are more efÏcient with respect to scaling. This formulation
also encapsulates the principle of diminishing returns: incremental gains in
performance decrease as 𝑁 increases.
Empirical evidence for scaling laws is most prominently observed in large
language models. In a seminal study, Kaplan et al. (2020) demonstrated that
the cross-entropy loss of transformer-based language models scales predictably
with three pivotal factors: the number of model parameters, the volume of
2
This study significantly al- the training dataset (measured in tokens), and the total computational budget
tered the machine learning com- (measured in floating-point operations).2
munity’s understanding of the im- When these factors are augmented proportionally, models exhibit consistent
pact of scale on model performance performance improvements without necessitating architectural modifications
through comprehensive empirical or task-specific tuning. This behavior underlies contemporary training strate-
validation of scaling laws. Its find- gies for large-scale language models and has significantly influenced design
ings directly influenced training decisions in both research and production environments.
methodologies for large language These empirical patterns are illustrated in Figure 9.5, which presents test
models such as GPT-3, establishing loss curves for models spanning a range of sizes, from 103 to 109 parameters.
a quantitative framework for pre- The figure reveals two key insights. First, larger models demonstrate supe-
dicting performance improvements rior sample efÏciency, achieving target performance levels with fewer training
based on compute, data, and model tokens. Second, as computational resources increase, the optimal model size
size. correspondingly grows, with loss decreasing predictably when compute is
Chapter 9. EfÏcient AI 371
From one to three Test-time scaling Figure 9.7: The three scaling
scaling laws "long thinking regimes: pre-training, post-training,
and test-time scaling. Each
regime exhibits different compute–
Intelligence
performance characteristics.
Post-training scaling
Pre-training scaling
Compute
efÏciency, and data efÏciency. These pillars represent critical domains that have
profoundly influenced how we navigate the trade-offs revealed by scaling laws.
Algorithmic efÏciency pertains to the design and optimization of algorithms
to maximize performance within given resource constraints. As scaling laws
indicate that larger models generally perform better, algorithmic efÏciency
becomes crucial for making these models practical and deployable. Contempo-
rary research focuses on techniques such as model compression, architectural
optimization, and algorithmic refinement, all aimed at preserving the benefits
of scale while minimizing resource consumption.
Compute efÏciency addresses the optimization of computational resources,
including hardware and energy utilization. Scaling laws have shown that
training compute requirements are growing at an exponential rate, making
compute efÏciency increasingly critical. The advent of specialized hardware
accelerators, such as GPUs and TPUs, has enabled the development of large-
scale models. However, the energy demands associated with training and
deploying these models have raised concerns regarding sustainability. Compute
efÏciency, therefore, encompasses strategies for optimizing hardware utilization,
reducing energy footprint, and exploring alternative computing paradigms
that can support continued scaling.
Data efÏciency focuses on maximizing the information gained from available
data while minimizing the required data volume. Scaling laws demonstrate
that model performance improves with larger datasets, but they also reveal
diminishing returns and practical limits to data collection. This pillar becomes
especially important as we approach the boundaries of available high-quality
data in domains like language modeling. Methods such as data augmentation,
active learning, and efÏcient data representation aim to achieve the benefits
predicted by scaling laws with reduced data requirements.
These three pillars are not mutually exclusive; rather, they are deeply inter-
twined and often mutually reinforcing. Improvements in one pillar can lead
to gains in others, and trade-offs between them are frequently necessary. As
we examine the historical evolution of these dimensions, as depicted in Fig-
ure 9.8, we will elucidate the dynamic interplay between algorithmic, compute,
and data efÏciency, providing a foundation for understanding how to achieve
efÏcient scaling in contemporary machine learning.
neural network, reducing both the model’s parameters and its computational
overhead (Yann LeCun, Denker, and Solla 1989). Quantization focused on
lowering the precision of numerical representations, enabling models to run
faster and with less memory (Jacob et al. 2018a). Knowledge distillation al-
lowed large, resource-intensive models (referred to as “teachers”) to transfer
their knowledge to smaller, more efÏcient models (referred to as “students”),
achieving comparable performance with reduced complexity (Hinton, Vinyals,
and Dean 2015a).
At the same time, new architectures specifically designed for efÏciency began
to emerge. Models such as MobileNet (A. G. Howard et al. 2017a), EfÏcientNet
(Tan and Le 2019b), and SqueezeNet (Iandola et al. 2016) demonstrated that
compact designs could deliver high performance, enabling their deployment
on devices with limited computational power, such as smartphones and IoT
devices6 . 6
MobileNet/EfÏcient-
Net/SqueezeNet: Compact neural
9.3.1.3 Modern EfÏciency
network architectures designed
As machine learning systems continue to grow in scale and complexity, the focus for efÏciency, balancing high
on algorithmic efÏciency has expanded to address sustainability and scalability. performance with reduced com-
Today’s challenges require balancing performance with resource efÏciency, putational demands. MobileNet
particularly as models like GPT-4 and beyond are applied to increasingly diverse introduced depthwise separable
tasks and environments. One emerging approach involves sparsity, where only convolutions (2017), EfÏcientNet
the most critical parameters of a model are retained, significantly reducing applied compound scaling (2019),
computational and memory demands. Hardware-aware design has also become and SqueezeNet focused on
a priority, as researchers optimize models to take full advantage of specific reducing parameters using 1x1
accelerators, such as GPUs, TPUs, and edge processors. Another important convolutions (2016).
trend is parameter-efÏcient fine-tuning, where large pre-trained models can be
adapted to new tasks by updating only a small subset of parameters. Low-Rank
Adaptation (LoRA)7 and prompt-tuning exemplify this approach, allowing
7
systems to achieve task-specific performance while maintaining the efÏciency Low-Rank Adaptation (LoRA):
advantages of smaller models. A technique that adapts large pre-
As shown in Figure 9.9, model training compute requirements have been trained models to new tasks by up-
growing at an accelerating rate, especially in the deep learning era. This trend dating only a small subset of param-
underscores the necessity for algorithmic innovations that enhance efÏciency eters, significantly reducing compu-
without compromising performance. tational and memory requirements.
These advancements reflect a broader shift in focus: from scaling models
indiscriminately to creating architectures that are purpose-built for efÏciency.
This modern era emphasizes not only technical excellence but also the practi-
cality and sustainability of machine learning systems.
8000
Scenario
Best
4000 Expected
Worst
2000
model and data efÏciency. For example, compact models reduce computational
requirements, while efÏcient data pipelines streamline hardware usage.
The evolution of compute efÏciency highlights its essential role in addressing
the growing demands of modern machine learning systems. From early reliance
on CPUs to the emergence of specialized accelerators and sustainable computing
practices, this dimension remains central to building scalable, accessible, and
environmentally responsible machine learning systems.
quantity and developing sophisticated techniques for data selection and pro-
cessing, the field is moving toward more sustainable and effective approaches
to model training and deployment.
Data
Efficiency
compact models not only consume fewer resources but are also easier to deploy
across diverse environments, such as resource-constrained edge devices or
energy-intensive cloud infrastructure.
Moreover, efÏcient models often require less data for training, as they avoid
over-parameterization and focus on capturing essential patterns within the
data. This results in shorter training times and reduced dependency on massive
datasets, which can be expensive and time-consuming to curate. As a result, op-
timizing algorithmic efÏciency creates a ripple effect, enhancing both compute
and data efÏciency.
Efficiency
However, overly simplifying a model can reduce its accuracy, especially for
complex tasks. To make up for this loss, additional computational resources
may be required during training to fine-tune the model or during deployment
to apply more sophisticated inference algorithms. Thus, while algorithmic
efÏciency can reduce computational costs, achieving this often places additional
strain on compute efÏciency.
9.5.1.4 Summary
The interdependencies between model, compute, and data efÏciency are the
foundation of a well-designed machine learning system. While these dimen-
sions can reinforce one another, building a system that achieves this synergy
often requires navigating difÏcult trade-offs. These trade-offs highlight the
complexity of designing machine learning systems that balance performance,
scalability, and resource constraints.
risks missing key edge cases, which could degrade the model’s performance in
diverse environments.
Conversely, in cloud-based systems, where compute resources are more abun-
dant, training on massive datasets can still pose challenges. Managing data
redundancy, ensuring high-quality labeling, and handling the time and cost
associated with large-scale data pipelines often require significant computa-
tional infrastructure. This trade-off highlights how the need to balance dataset
size and model generalization depends heavily on the deployment context and
available resources.
9.5.2.4 Summary
The interplay between model complexity, compute resources, energy efÏciency,
real-time performance, and dataset size illustrates the inherent trade-offs in
machine learning system design. These trade-offs are rarely one-dimensional;
decisions to optimize one aspect of a system often ripple through the others,
requiring careful consideration of the specific goals and constraints of the
application.
Designers must weigh the advantages and limitations of each trade-off in the
context of the deployment environment. For instance, a cloud-based system
might prioritize scalability and throughput over energy efÏciency, while an
edge system must balance real-time performance with strict power constraints.
Similarly, resource-limited Tiny ML deployments require exceptional data and
algorithmic efÏciency to operate within severe hardware restrictions.
By understanding these common trade-offs, we can begin to identify strate-
gies for navigating them effectively. The next section will explore practical
approaches to managing these tensions, focusing on techniques and design
principles that enable system efÏciency while addressing the complexities of
real-world applications.
9.6.3 Co-Design
EfÏcient machine learning systems are rarely the product of isolated optimiza-
tions. Achieving balance across model, compute, and data efÏciency requires
an end-to-end perspective, where each component of the system is designed in
tandem with the others. This holistic approach, often referred to as co-design,
involves aligning model architectures, hardware platforms, and data pipelines
to work seamlessly together.
One of the key benefits of co-design is its ability to mitigate trade-offs by
tailoring each component to the specific requirements of the system. For in-
stance, consider a speech recognition system deployed on a mobile device. The
model must be compact enough to fit within the device’s tiny ML memory con-
straints while still delivering real-time performance. By designing the model
architecture to leverage the capabilities of hardware accelerators, such as NPUs,
13
it becomes possible to achieve low-latency inference without excessive energy 8-bit models: ML models
consumption. Similarly, careful preprocessing and augmentation of the training use 8-bit integer representations for
data can ensure robust performance, even with a smaller, streamlined model. weights and activations instead of
Co-design becomes essential in resource-constrained environments like Edge the standard 32-bit floating-point
ML and Tiny ML deployments. Models must align precisely with hardware format, reducing memory usage
capabilities. For example, 8-bit models13 require hardware support for efÏcient and computational requirements for
integer operations, while pruned models benefit from sparse tensor operations. faster, more energy-efÏcient infer-
Similarly, edge accelerators often optimize specific operations like convolutions ence on compatible hardware.
9.6. Managing Trade-offs 402
9.6.4 Automation
Navigating the trade-offs between model, compute, and data efÏciency is a
complex task that often involves numerous iterations and expert judgment.
Automation and optimization tools have emerged as powerful solutions for
managing these challenges, streamlining the process of balancing efÏciency
dimensions while reducing the time and expertise required.
One widely used approach is automated machine learning (AutoML), which
enables the exploration of different model architectures, hyperparameter con-
figurations, and feature engineering techniques. By automating these aspects
of the design process, AutoML can identify models that achieve an optimal
balance between performance and efÏciency. For instance, an AutoML pipeline
might search for a lightweight model architecture that delivers high accuracy
while fitting within the resource constraints of an edge device (F. Hutter, Kot-
thoff, and Vanschoren 2019a). This approach reduces the need for manual
trial-and-error, making optimization faster and more accessible.
Neural architecture search (NAS) takes automation a step further by design-
ing model architectures tailored to specific hardware or deployment scenarios.
NAS algorithms evaluate a wide range of architectural possibilities, selecting
those that maximize performance while minimizing computational demands.
For example, NAS can design models that leverage quantization or sparsity
techniques, ensuring compatibility with energy-efÏcient accelerators like TPUs
or microcontrollers (Elsken, Metzen, and Hutter 2019a). This automated co-
design of models and hardware helps mitigate trade-offs by aligning efÏciency
goals across dimensions.
Data efÏciency, too, benefits from automation. Tools that automate dataset
curation, augmentation, and active learning reduce the size of training datasets
without sacrificing model performance. These tools prioritize high-value data
points, ensuring that models are trained on the most informative examples.
This not only speeds up training but also reduces computational overhead,
reinforcing both compute and algorithmic efÏciency (Settles 2012b).
While automation tools are not a panacea, they play a critical role in address-
ing the complexity of trade-offs. By leveraging these tools, system designers
Chapter 9. EfÏcient AI 403
can achieve efÏcient solutions more quickly and at lower cost, freeing them to
focus on broader design challenges and deployment considerations.
9.6.5 Summary
Designing efÏcient machine learning systems requires a deliberate approach
to managing trade-offs between model, compute, and data efÏciency. These
trade-offs are influenced by the context of the deployment, the constraints
of the hardware, and the goals of the application. By prioritizing efÏciency
dimensions based on the specific needs of the system, embracing end-to-end
co-design, and leveraging automation tools, it becomes possible to navigate
these challenges effectively.
The strategies explored illustrate how thoughtful design can transform trade-
offs into opportunities for synergy. For example, aligning model architectures
with hardware capabilities can mitigate energy constraints, while automation
tools like AutoML and NAS streamline the process of optimizing efÏciency
dimensions. These approaches underscore the importance of treating system
efÏciency as a holistic endeavor, where components are designed to complement
and reinforce one another.
For example, data collection and preprocessing are often the starting points
of the pipeline. The quality and diversity of the data directly impact model
performance and efÏciency. Curating smaller, high-quality datasets can reduce
computational costs during training while simplifying the model’s design.
However, insufÏcient data diversity may affect generalization, necessitating
compensatory measures in model architecture or training procedures. By
aligning the data strategy with the model and deployment context, designers
can avoid inefÏciencies downstream.
Model training is another critical stage. The choice of architecture, optimiza-
tion techniques, and hyperparameters must consider the constraints of the
deployment hardware. A model designed for high-performance cloud systems
may emphasize accuracy and scalability, leveraging large datasets and com-
pute resources. Conversely, a model intended for edge devices must balance
accuracy with size and energy efÏciency, often requiring compact architectures
and quantization techniques tailored to specific hardware.
Deployment and inference demand precise hardware alignment. Each plat-
form offers distinct capabilities. GPUs excel at parallel matrix operations,
TPUs optimize specific neural network computations, and microcontrollers
provide energy-efÏcient scalar processing. For example, a smartphone speech
recognition system might leverage an NPU’s dedicated convolution units for
5-millisecond inference times at 1-watt power draw, while an autonomous
vehicle’s FPGA-based accelerator processes multiple sensor streams with 50-
microsecond latency. This hardware-software integration determines real-
world efÏciency.
An end-to-end perspective ensures that trade-offs are addressed holistically,
rather than shifting inefÏciencies from one stage of the pipeline to another. By
treating the system as an integrated whole, machine learning practitioners can
design solutions that are not only efÏcient but also robust and scalable across
diverse deployment scenarios.
9.7.2 Scenarios
The efÏciency needs of machine learning systems differ significantly depending
on the lifecycle stage and deployment environment. From research prototypes to
production systems, and from high-performance cloud applications to resource-
constrained edge deployments, each scenario presents unique challenges and
trade-offs. Understanding these differences is crucial for designing systems
that meet their operational requirements effectively.
mobile devices, and IoT sensors, impose strict limitations on compute power,
memory, and energy consumption. Transitioning from a research prototype to a
production-ready system often involves significant optimization, such as model
pruning, quantization, or retraining on targeted datasets. This shift highlights
the need to balance performance and efÏciency as systems move from concept
to deployment.
9.7.3 Summary
Designing machine learning systems with efÏciency in mind requires a holistic
approach that considers the specific needs and constraints of the deployment
context. From research prototypes to production systems, and across environ-
ments as varied as cloud data centers, mobile devices, and Tiny ML applications,
the priorities for efÏciency differ significantly. Each stage of the machine learn-
ing pipeline, including data collection, model design, training, deployment,
and inference, presents unique trade-offs that must be navigated thoughtfully.
9.8. Broader Challenges 406
105
1965
103
102 1970
101
100
100 101 102 103 104 105
Number of components per integrated circuit
Even when data is available, the ability to process and curate it efÏciently
depends on computational and human resources. Large organizations rou-
tinely employ data engineering teams and automated pipelines for curation
and augmentation, enabling them to optimize data efÏciency and improve
downstream performance. In contrast, smaller groups often lack access to the
tools or expertise needed for such tasks, leaving them at a disadvantage in both
research and practical applications.
Democratizing data efÏciency requires more open sharing of pre-trained
models and datasets. Initiatives like Hugging Face’s open access to transform-
ers or multilingual models by organizations like Meta’s No Language Left
Behind aim to make state-of-the-art NLP models available to researchers and
practitioners worldwide. These efforts help reduce the barriers to entry for
data-scarce regions, enabling more equitable access to AI capabilities.
EfÏciency often favors established techniques and systems that have already
been proven to work well. For instance, optimizing neural networks through
pruning, quantization, or distillation typically involves refining existing archi-
tectures rather than developing entirely new ones. While these approaches
provide incremental improvements, they may come at the cost of exploring
novel designs or paradigms that could yield transformative breakthroughs.
Consider the shift from traditional machine learning methods to deep learn-
ing. Early neural network research in the 1990s and 2000s required significant
computational resources and often failed to outperform simpler methods on
practical tasks. Despite this, researchers continued to push the boundaries of
what was possible, eventually leading to the breakthroughs in deep learning
that define modern AI. If the field had focused exclusively on efÏciency during
that period, these innovations might never have emerged.
9.9 Conclusion
EfÏciency in machine learning systems is essential not just for achieving techni-
cal goals but for addressing broader questions about scalability, sustainability,
and inclusivity. This chapter has focused on the why and how of efÏciency—
why it is critical to modern machine learning and how to achieve it through a
balanced focus on model, compute, and data dimensions. We began by explor-
ing the empirical foundations of scaling laws, revealing how model performance
scales with resources and highlighting the critical importance of efÏcient re-
source utilization as models grow in complexity. The trade-offs and challenges
inherent in scaling, as well as the potential for scaling breakdowns, underscore
the necessity of a holistic approach to system design.
By understanding the interdependencies and trade-offs inherent in the algo-
rithmic, compute, and data dimensions of efÏciency, we can build systems that
align with their operational contexts and long-term objectives. The challenges
discussed in this chapter, from the limits of optimization to equity concerns
and the tension between efÏciency and innovation, highlight the need for a
Chapter 9. EfÏcient AI 413
9.10 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 10
Model Optimizations
415
10.1. Overview 416
L Learning Objectives
10.1 Overview
As machine learning models evolve in complexity and become increasingly
ubiquitous, the focus shifts from solely enhancing accuracy to ensuring that
models are practical, scalable, and efÏcient. The substantial computational
requirements for training and deploying state-of-the-art models frequently
surpass the limitations imposed by real-world environments, whether in ex-
pansive data centers or on resource-constrained mobile devices. Additionally,
considerations such as memory constraints, energy consumption, and inference
latency critically influence the effective deployment of these models. Model
optimization, therefore, serves as the framework that reconciles advanced mod-
eling techniques with practical system limitations, ensuring that enhanced
performance is achieved without compromising operational viability.
The necessity for model optimization arises from the inherent limitations of
modern computational systems. Machine learning models function within a
Chapter 10. Model Optimizations 417
which the model runs, and the operational requirements of the application. Un-
derstanding these constraints is essential for developing effective optimization
strategies that balance accuracy, efÏciency, and feasibility. The primary system
constraints that drive model optimization include:
Computational Cost: Training and inference require significant compute
resources, especially for large-scale models. The computational complexity of a
model affects the feasibility of training on large datasets and deploying real-
time inference workloads. Optimization techniques that reduce computation,
including pruning, quantization, and efÏcient architectures, can significantly
lower costs.
Memory and Storage Limitations: Models must fit within the memory con-
straints of the target system. This includes RAM3 limitations during execution
3
Random Access Memory and storage constraints for model persistence. Large models with billions of
(RAM): Hardware feature that pro- parameters may exceed the capacity of edge devices or embedded systems, ne-
vides fast, volatile working memory cessitating optimizations that reduce memory footprint without compromising
for temporary data storage during performance.
program execution. Unlike persis- Latency and Throughput: Many applications impose real-time constraints,
tent storage devices like the hard requiring models to produce predictions within strict latency budgets. In
disk, RAM enables rapid data access autonomous systems, healthcare diagnostics, and interactive AI applications,
but loses its contents when powered slow inference times can render a model unusable. Optimizing model execution,
off, making it critical for efÏcient by employing reduced precision arithmetic, optimizing data movement, or
computation and memory-intensive utilizing parallel computation, can help meet real-time constraints.
operations. Energy EfÏciency and Power Consumption: Power constraints are critical in
mobile, edge, and embedded AI systems. High energy consumption impacts
battery-powered devices and increases operational costs in large-scale cloud
deployments. Techniques such as model sparsity,4 adaptive computation,5 and
4
Model Sparsity: Refers to tech- hardware-aware optimization contribute to energy-efÏcient AI.
niques that involve using fewer non- Scalability and Hardware Compatibility: Model optimizations must align
zero parameters within a model with the capabilities of the target hardware. A model optimized for special-
to reduce complexity and increase ized accelerators (e.g., GPUs, TPUs, FPGAs) may not perform efÏciently on
speed. general-purpose CPUs. Additionally, scaling models across distributed systems
5
introduces new challenges in synchronization and workload balancing.
Adaptive Computation: Dy-
These constraints are interdependent, meaning that optimizing for one factor
namic adjustment of computational
may impact another. For example, reducing numerical precision can lower
resources based on the task com-
memory usage and improve inference speed but may introduce quantization
plexity to optimize efÏciency.
errors that degrade accuracy. Similarly, aggressive pruning can reduce compu-
tation but may lead to diminished generalization if not carefully managed.
7
Hardware-Aware Scheduling:
Table 10.1: Mapping of system constraints to optimization dimensions.
Optimizing computational tasks
based on the specific hardware char- System Constraint Model Representation Numerical Precision Architectural EfÏciency
acteristics. Computational Cost � ✓ ✓
Memory and Storage ✓ ✓ �
Latency and ✓ � ✓
Throughput
Energy EfÏciency � ✓ ✓
Scalability ✓ � ✓
10.4.1 Pruning
State-of-the-art machine learning models often contain millions, or even bil-
lions, of parameters, many of which contribute minimally to final predictions.
While large models enhance representational power and generalization, they
also introduce inefÏciencies that impact both training and deployment. From
a machine learning systems perspective, these inefÏciencies present several
challenges:
1. High Memory Requirements: Large models require substantial storage,
limiting their feasibility on resource-constrained devices such as smart-
phones, IoT devices, and embedded systems. Storing and loading these
models also creates bandwidth bottlenecks in distributed ML pipelines.
2. Increased Computational Cost: More parameters lead to higher inference
latency and energy consumption, which is particularly problematic for
real-time applications such as autonomous systems, speech recognition,
and mobile AI. Running unoptimized models on hardware accelerators
like GPUs and TPUs requires additional compute cycles, increasing oper-
ational costs.
3. Scalability Limitations: Training and deploying large models at scale is
resource-intensive in terms of compute, memory, and power. Large-scale
distributed training demands high-bandwidth communication and stor-
age, while inference in production environments becomes costly without
optimizations.
Definition of Pruning
Pruning allows models to become smaller, faster, and more efÏcient with-
out requiring fundamental changes to their architecture. By reducing redun-
dancy, pruning directly addresses the memory, computation, and scalability
constraints of machine learning systems, making it a key optimization technique
for deploying ML models across cloud, edge, and mobile platforms.
where:
• ℒ(𝑊̂ ) represents the model’s loss function after pruning.
• 𝑊̂ denotes the pruned model’s parameters.
• ‖𝑊̂ ‖0 is the number of nonzero parameters in 𝑊̂ , constrained to a budget
𝑘.
As illustrated in Figure 10.3, pruning reduces the number of nonzero weights
by eliminating small-magnitude values, transforming a dense weight matrix
into a sparse representation. This explicit enforcement of sparsity aligns with
the ℓ0 -norm constraint in our optimization formulation.
0.01 0.02
0.01 0.01
-1.9 0.02
0.01 1.76
0.01 0.01 0.01 3.75
0.01 0.02
0.01 0.01 0.01 0 0 -1.9
0 0 1.76
0 0 0 3.75
0 0 0 0
Figure 10.3: Weight matrix before
0.01 0.02
7.93 0.01 0.01 0.68
0.01 0.02
0.01 0.01 0.01
-1.1 0.01 0.02
0.01 0.01 0.01 7.93
0 0 0 0.68
0 0 0 0
-1.1 0 0 0 0 and after pruning.
0.01 0.02
0.01 0.01 0.01 0.01 0.01 0.02
0.01 0.01 0.01
-2.5 0.01 0.01 0 0 0 0 0 0 0 0 -2.5
0 0 0
0.01 0.02
0.01 0.02
0.01 0.01
2.4 0.02
0.01 0.01
-3.1 0.01 0.01 0.02
0.01 0.01 8.26
0.01 0 0 0 0
2.4 0 -3.1
0 0 0 0 0 8.26
0
0.01 9.77
0.96 0.01 0.92
0.01 0.01 0.01 0.01 0.01
8.5 0.01
6.6 0.01 0.01 0.01 0.96
0 9.77
0 0.92
0 0 0 0 8.5
0 6.6
0 0 0 0
0.01 0.01
0.03 0.8 0.01 0.03
0.01 0.01 0.02
0.01 0.03
0.01 0.02
0.01 0.02
0.01 0.01 0.02
0.01 0 0.8
0 0 0 0 0 0 0 0 0 0
0.01 0.02
0.01 0.01 0.02
0.01 0.01 0.01 0.03
0.01 0.01
0.7 14.8
0.01 0.01 0.91
0.01 0 0 0 0 0 0 0 0.7
0 14.8
0 0 0.91
0
0.01 0.02
0.02 0.01 0.01 0.01 0.02
0.01 0.01 -0.38
0.01 0.01 0.01 0.03
0.01 10.1
0.01 0 0 0 0 0 0 -0.38
0 0 0 0 10.1
0
0.01 0.03
0.01 16.3
0.01 0.03
0.01 0.01 0.01
2.9 0.01 0.01 0.02
0.01 0.01
-5.4 0.01 0 0 16.3
0 0 0 2.9
0 0 0 0 -5.4
0 0
where 𝜆 controls the degree of sparsity. The ℓ1 -norm encourages smaller weight
values and promotes sparsity but does not strictly enforce zero values. Other
methods use iterative heuristics, where parameters with the smallest magni-
tudes are pruned in successive steps, followed by fine-tuning to recover lost
accuracy (Gale, Elsen, and Hooker 2019a).
the final computation. Pruning these weak connections can reduce memory
requirements while preserving most of the model’s accuracy.
Mathematically, unstructured pruning introduces sparsity into the weight
matrices of a neural network. Let 𝑊 ∈ ℝ𝑚×𝑛 represent a weight matrix in a
given layer of a network. Pruning removes a subset of weights by applying a
binary mask 𝑀 ∈ {0, 1}𝑚×𝑛 , yielding a pruned weight matrix:
𝑊̂ = 𝑀 ⊙ 𝑊
beneficial when the goal is to compress a model for storage rather than to
accelerate inference speed. While unstructured pruning improves model efÏ-
ciency at the parameter level, it does not alter the structural organization of the
network.
Convolutional Channels
kernel
Figure 10.5: Unstructured vs struc-
tured pruning. Source: C. Qi et al.
(2021).
Convolutional
neural network
Unstructured pruning Structured pruning
Chapter 10. Model Optimizations 429
for specific inputs (J. Hu et al. 2023). This method introduces input-dependent
sparsity patterns, effectively reducing the computational workload during
inference without permanently modifying the model architecture.
For instance, consider a convolutional neural network processing images
with varying complexity. During inference of a simple image containing mostly
uniform regions, many convolutional filters may produce negligible activations.
Dynamic pruning identifies these low-impact filters and temporarily excludes
them from computation, improving efÏciency while maintaining accuracy
for the current input. This adaptive behavior is particularly advantageous in
latency-sensitive applications, where computational resources must be allocated
judiciously based on input complexity.
Another class of dynamic pruning operates during training, where sparsity
is gradually introduced and adjusted throughout the optimization process.
Methods such as gradual magnitude pruning start with a dense network and
progressively increase the fraction of pruned parameters as training progresses.
Instead of permanently removing parameters, these approaches allow the net-
work to recover from pruning-induced capacity loss by regrowing connections
that prove to be important in later stages of training.
Dynamic pruning presents several advantages over static pruning. It allows
models to adapt to different workloads, potentially improving efÏciency while
maintaining accuracy. Unlike static pruning, which risks over-pruning and
degrading performance, dynamic pruning provides a mechanism for selectively
reactivating parameters when necessary. However, implementing dynamic
pruning requires additional computational overhead, as pruning decisions
must be made in real-time, either during training or inference. This makes it
more complex to integrate into standard machine learning pipelines compared
to static pruning.
Despite its challenges, dynamic pruning is particularly useful in edge comput-
ing and adaptive AI systems, where resource constraints and real-time efÏciency
requirements vary across different inputs. The next section explores the practi-
cal considerations and trade-offs involved in choosing the right pruning method
for a given machine learning system.
(FLOPs) required during inference. The downside is that modifying the network
structure can lead to a greater accuracy drop, requiring careful fine-tuning to
recover lost performance.
Dynamic pruning introduces adaptability into the pruning process by adjust-
ing which parameters are pruned at runtime based on input data or training
dynamics. This allows for a better balance between accuracy and efÏciency, as
the model retains the flexibility to reintroduce previously pruned parameters if
needed. However, dynamic pruning increases implementation complexity, as
it requires additional computations to determine which parameters to prune
on-the-fly.
Table 10.2 summarizes the key structural differences between these pruning
approaches, outlining how each method modifies the model and impacts its
execution.
three cycles. Following each pruning step, the model undergoes fine-tuning to
recover performance. The first iteration, which removes two channels, results
in an accuracy decrease from 0.995 to 0.971, but subsequent fine-tuning restores
accuracy to 0.992. After completing two additional pruning-tuning cycles, the
final model achieves 0.991 accuracy, which represents only a 0.4% reduction
from the original, while operating with 27% fewer channels. By distributing
structural modifications across multiple iterations, the network maintains its
performance capabilities while achieving improved computational efÏciency.
1st Iteration
Starting Accuracy:
0.995
Prune Fine-tune
Figure 10.6: Iterative pruning. selected Test Accuracy: on new Test Accuracy:
channels structure
0.971
0.992
2nd Iteration
Starting Accuracy:
0.992
Prune Fine-tune
selected Test Accuracy: on new Test Accuracy:
channels structure
0.956
0.993
3rd Iteration
Starting Accuracy:
0.993
Prune Fine-tune
selected Test Accuracy: on new Test Accuracy:
channels structure
0.967
0.991
Iterate
Reset weights Prune a
to initial values percentage of
the lowest weights
0.5
0.08
0.45
0.98
(Sanh et al. 2019). In computer vision, EfÏcientNet has been pruned to re-
move unnecessary filters, optimizing it for deployment in resource-constrained
environments (Tan and Le 2019a).
Distillation
Figure 10.9: Knowledge distillation.
Loss Fn
loss
Input x
Student (distilled) model Softmax (T = t) Soft predictions
Student
Loss Fn
loss
Hard
label y
(Ground truth)
The training process for the student model incorporates two loss terms:
• Distillation loss: A loss function (often based on Kullback-Leibler (KL)
divergence) that minimizes the difference between the student’s and
teacher’s soft label distributions.
10.4. Model Representation Optimization 438
exp(𝑧𝑖 )
𝑝𝑖 =
∑𝑗 exp(𝑧𝑗 )
exp(𝑧𝑖 /𝑇 )
𝑝𝑖 (𝑇 ) =
∑𝑗 exp(𝑧𝑗 /𝑇 )
The student model is then trained using a loss function that minimizes the
difference between its output distribution and the teacher’s softened output
distribution. The most common formulation combines two loss terms:
where:
• ℒCE (𝑦𝑠 , 𝑦) is the standard cross-entropy loss between the student’s pre-
dictions 𝑦𝑠 and the ground truth labels 𝑦.
• The second term minimizes the Kullback-Leibler (KL) divergence between
the teacher’s softened predictions 𝑝𝑖𝑇 and the student’s predictions 𝑝𝑖,𝑠
𝑇
.
• The factor 𝑇 2 ensures that gradients remain appropriately scaled when
using high-temperature values.
• The hyperparameter 𝛼 balances the importance of the standard training
loss versus the distillation loss.
By learning from both hard labels and soft teacher outputs, the student model
benefits from the generalization power of the teacher, improving its ability to
distinguish between similar classes even with fewer parameters.
80%
Probability
40%
20%
0%
Cat Dog Fox
Animal
10.4.2.5 Trade-offs
Knowledge distillation is a powerful technique for compressing large models
into smaller, more efÏcient versions while maintaining accuracy. By training a
student model under the supervision of a teacher model, distillation enables bet-
ter generalization and inference efÏciency compared to training a small model
from scratch. It is particularly effective in low-resource environments, such
as mobile devices, edge AI, and large-scale cloud inference, where balancing
accuracy, speed, and memory footprint is essential.
Compared to pruning, distillation preserves accuracy better but comes at
the cost of higher training complexity, as it requires training a new model
instead of modifying an existing one. However, pruning provides a more direct
computational efÏciency gain, especially when structured pruning is used. In
practice, combining pruning and distillation often yields the best trade-off, as
seen in models like DistilBERT and MobileBERT, where pruning first reduces
unnecessary parameters before distillation optimizes a final student model.
Table 10.4 summarizes the key trade-offs between knowledge distillation and
pruning.
storage and computational costs. Unlike pruning, which creates sparse repre-
sentations, or distillation, which requires an additional training process, LRMF
is a purely mathematical transformation that decomposes a weight matrix into
two or more smaller matrices.
This structured compression is particularly useful in machine learning sys-
tems where efÏciency is a primary concern, such as edge computing, cloud
inference, and hardware-accelerated ML execution. By leveraging low-rank
approximations, models can achieve substantial reductions in parameter stor-
age while maintaining predictive accuracy, making LRMF a valuable tool for
optimizing machine learning architectures.
Training Mathematics. LRMF is a mathematical technique used in linear alge-
bra and machine learning systems to approximate a high-dimensional matrix
by decomposing it into the product of lower-dimensional matrices. This factor-
ization enables a more compact representation of model parameters, reducing
both memory footprint and computational complexity while preserving essen-
tial structural information. In the context of machine learning systems, LRMF
plays a crucial role in optimizing model efÏciency, particularly for resource-
constrained environments such as edge AI and embedded deployments.
Formally, given a matrix 𝐴 ∈ ℝ𝑚×𝑛 , LRMF seeks two matrices 𝑈 ∈ ℝ𝑚×𝑘 and
𝑉 ∈ ℝ𝑘×𝑛 such that:
𝐴 ≈ 𝑈𝑉
where 𝑘 is the rank of the approximation, typically much smaller than both
𝑚 and 𝑛. This approximation is commonly obtained through singular value
decomposition (SVD), where 𝐴 is factorized as:
𝐴 = 𝑈 Σ𝑉 𝑇
xi
≈ ui
vi
M yijt
N ×R
V ∈R
(i, j, t)-th
T
U ∈ RM ×R
N
y ∈ RM ×N ×T
𝑘
𝒜 ≈ ∑ 𝑢𝑟 ⊗ 𝑣 𝑟 ⊗ 𝑤 𝑟
𝑟=1
LRMF vs. TD. Both low-rank matrix factorization and tensor decomposition
serve as fundamental techniques for reducing the complexity of machine learn-
10.4. Model Representation Optimization 450
Despite these differences, LRMF and tensor decomposition are not mutually
exclusive. In many machine learning models, both methods can be applied
together to optimize different components of the architecture. For example,
fully connected layers may be compressed using LRMF, while convolutional
kernels and attention tensors undergo tensor decomposition. The choice of
technique ultimately depends on the specific characteristics of the model and
the trade-offs between storage efÏciency and computational complexity.
Chapter 10. Model Optimizations 451
One-shot approach:
learning model architecture parameters and weights together
of selecting the most accurate model, NAS identified architectures that pro-
vided the best balance between accuracy and inference speed (B. Wu et al.
2019). Similarly, EfÏcientNet was discovered through NAS by jointly optimiz-
ing for accuracy and computational efÏciency, resulting in a model that delivers
state-of-the-art performance while reducing FLOPs compared to conventional
architectures (Tan and Le 2019a).
By integrating these constraints into the search process, NAS systematically
discovers architectures that balance accuracy, efÏciency, and hardware adapt-
ability. Instead of manually fine-tuning these trade-offs, NAS automates the
selection of optimal architectures, ensuring that models are well-suited for
real-world deployment scenarios.
Energy (pJ)
1 Integer ADD (8b) 0.03
30
2 Integer ADD (16b) 0.05 100X
3 Integer ADD (32b) 0.10 20
b)
b)
)
6b
2b
2b
2b
2b
2b
(8
(8
(1
(3
(3
(3
(3
(3
8 1 MB SRAM Read (32b) 50.00
LT
D
LT
d
D
ea
ea
ea
D
D
rA
U
rM
rA
rA
R
rM
ge
ge
AM
AM
AM
ge
ge
ge
te
te
te
te
In
SR
SR
SR
te
In
In
In
In
KB
KB
B
M
8
32
1
Operation
just 0.03 pJ. These savings compound when considering large-scale models
operating across billions of operations.
Beyond direct compute savings, reducing numerical precision has a signifi-
cant impact on memory energy consumption, which often dominates total sys-
tem power. Lower-precision representations reduce data storage requirements
and memory bandwidth usage, leading to fewer and more efÏcient memory
accesses. This is critical because accessing memory, particularly off-chip DRAM,
is far more energy-intensive than performing arithmetic operations. For in-
stance, DRAM accesses require orders of magnitude more energy (1.3–2.6 nJ)
compared to cache accesses (e.g., 10 pJ for an 8 KB L1 cache access). The break-
down of instruction energy further underscores the cost of moving data within
the memory hierarchy, where an instruction’s total energy can be significantly
impacted by memory access patterns.
By reducing numerical precision, models can not only execute computations
more efÏciently but also reduce data movement, leading to lower overall energy
consumption. This is particularly important for hardware accelerators and edge
devices, where memory bandwidth and power efÏciency are key constraints.
Precision
Value
FP32
100
INT8
500
800 ms 135 MB
700 ms
70 ms 50
45 MB 13 MB
300 ms
24 MB
0 30 ms 0 4 MB
The figure above illustrates the quantization error weighted by the probability
distribution of values, comparing different numerical formats (FP8 variants
and INT8). The error distribution highlights how different formats introduce
10.5. Numerical Precision Optimization 458
varying levels of quantization noise across the range of values, which in turn
influences model accuracy and stability.
FP16 and bfloat16 formats provide moderate efÏciency gains while preserving
model accuracy. Many AI accelerators, such as NVIDIA Tensor Cores and TPUs,
include dedicated support for FP16 computations, enabling 2× faster matrix
operations compared to FP32. BFloat16, in particular, retains the same 8-bit
10.5. Numerical Precision Optimization 460
formats.
1-bit
Float16 5-bit exponent 10-bit mantissa
sign
1-bit
BFloat16 8-bit exponent 7-bit mantissa
sign
the chosen range minimizes loss of information and helps preserve the model’s
performance after precision reduction.
The overall workflow of post-training quantization is illustrated in Figure 10.18.
The process begins with a pre-trained model, which serves as the starting point
for optimization. To determine an effective quantization range, a calibration
dataset, which is a representative subset of training or validation data—is
passed through the model. This step allows the calibration process to estimate
the numerical distribution of activations and weights, which is then used to
define the clipping range for quantization. Following calibration, the quantiza-
tion step converts the model parameters to a lower-precision format, producing
the final quantized model, which is more efÏcient in terms of memory and
computation.
Quantization
Quantized model
method helps avoid the impact of outliers, which are not representative
of the general data distribution.
α = −1 0 β = −1 α = −0.5 0 SZ β = −1.5
r r
Q Q
−127 0 127 −128 −Z 0 127
On the left side of Figure 10.20, symmetric calibration is depicted, where the
clipping range is centered around zero. The range extends from 𝛼 = −1 to 𝛽 = 1,
mapping these values to the integer range [−127, 127]. This method ensures
that positive and negative values are treated equally, preserving zero-centered
10.5. Numerical Precision Optimization 466
Output: ŷ
Figure 10.21: Quantization granu-
larity: variable ranges. Source: A. et
Layer N Filter 1
al. Gholami (2021).
Layer N 1
Filter 2
...
Layer 2
Filter 3
...
...
...
Layer 1
Input: x Filter C
Layerwise Channelwise
Quantization Quantization
Int8
Float input Quantization
x
0.5
Static vs. Dynamic Quantization. After determining the type and granularity
of the clipping range, practitioners must decide when the clipping ranges are
calculated in their quantization algorithms. Two primary approaches exist for
quantizing activations: static quantization and dynamic quantization.
Static Quantization is the more commonly used approach. In static quantiza-
tion, the clipping range is pre-calculated and remains fixed during inference.
This method does not introduce any additional computational overhead during
runtime, which makes it efÏcient in terms of computational resources. However,
the fixed range can lead to lower accuracy compared to dynamic quantization.
A typical implementation of static quantization involves running a series of
calibration inputs to compute the typical range of activations, as discussed in
works like (Jacob et al. 2018b) and (Yao et al. 2021).
In contrast, Dynamic Quantization dynamically calculates the range for each
activation map during runtime. This approach allows the quantization process
to adjust in real time based on the input, potentially yielding higher accuracy
since the range is specifically calculated for each input activation. However,
dynamic quantization incurs higher computational overhead because the range
must be recalculated at each step. Although this often results in higher accuracy,
the real-time computations can be expensive, particularly when deployed at
scale.
The following table, Table 10.7, summarizes the characteristics of post-training
quantization, quantization-aware training, and dynamic quantization, provid-
ing an overview of their respective strengths, limitations, and trade-offs. These
methods are widely deployed across machine learning systems of varying scales,
and understanding their pros and cons is crucial for selecting the appropriate
approach for a given application.
PTQ Advantages. One of the key advantages of PTQ is its low computational
cost, as it does not require retraining the model. This makes it an attractive
option for the rapid deployment of trained models, particularly when retrain-
ing is computationally expensive or infeasible. Since PTQ only modifies the
numerical representation of weights and activations, the underlying model
10.5. Numerical Precision Optimization 470
Pre-trained model
Retraining/Finetuning
Quantized model
In many cases, QAT can also build off PTQ, as shown in Figure 10.24. Instead
of starting from a full-precision model, PTQ is first applied to produce an
initial quantized model, leveraging calibration data to determine appropriate
quantization parameters. This PTQ model then serves as the starting point for
QAT, where additional fine-tuning with training data helps the model better
adapt to low-precision constraints. This hybrid approach benefits from the
efÏciency of PTQ while reducing the accuracy degradation typically associated
with post-training quantization alone.
Pretrained model
PTQ model
QAT
Training data Finteune model
QAT model
10.5.6.1 Binarization
Binarization involves reducing weights and activations to just two values, typ-
ically -1 and +1, or 0 and 1, depending on the specific method. The primary
advantage of binarization lies in its ability to drastically reduce the size of a
model, allowing it to fit into a very small memory footprint. This reduction
also accelerates inference, especially when deployed on specialized hardware
such as binary neural networks (Rastegari et al. 2016). However, binarization
introduces significant challenges, primarily in terms of model accuracy. When
weights and activations are constrained to only two values, the expressiveness
of the model is greatly reduced, which can lead to a loss in accuracy, particularly
in tasks requiring high precision, such as image recognition or natural language
processing (Hubara et al. 2018).
Moreover, the process of binarization introduces non-differentiable oper-
ations, which complicates the optimization process. To address this issue,
techniques such as the STE are employed to approximate gradients, allowing
for effective backpropagation despite the non-differentiability of the quanti-
zation operation (Y. Bengio, Léonard, and Courville 2013b). The use of STE
ensures that the network can still learn and adjust during training, even with
the extreme precision reduction. While these challenges are non-trivial, the
potential benefits of binarized models in ultra-low-power environments, such
as edge devices and IoT sensors, make binarization an exciting area of research.
10.5.6.2 Ternarization
Ternarization extends binarization by allowing three possible values for weights
and activations—typically -1, 0, and +1. While ternarization still represents
a significant reduction in precision, it offers a slight improvement in model
accuracy over binarization, as the additional value (0) provides more flexibility
in capturing the underlying patterns (Zhu et al. 2017). This additional precision
comes at the cost of increased complexity, both in terms of computation and
the required training methods. Similar to binarization, ternarization is often
implemented using techniques that approximate gradients, such as the hard
thresholding method or QAT, which integrate quantization effects into the
training process to mitigate the accuracy loss (J. Choi et al. 2018).
The advantages of ternarization over binarization are most noticeable when
28
dealing with highly sparse data28 . In some cases, ternarization can introduce Data with a high proportion
more sparsity into the model by mapping a large portion of weights to zero. of zero values, often found in large
However, managing this sparsity effectively requires careful implementation to datasets.
10.5. Numerical Precision Optimization 476
avoid the overhead that comes with storing sparse matrices (F. Li et al. 2016).
Additionally, while ternarization improves accuracy compared to binarization,
it still represents a severe trade-off in terms of the model’s ability to capture
intricate relationships between inputs and outputs. The challenge, therefore,
lies in finding the right balance between the memory and computational sav-
ings offered by ternarization and the accuracy loss incurred by reducing the
precision.
𝑑 = 𝛼𝜙 𝑑0 , 𝑤 = 𝛽 𝜙 𝑤0 , 𝑟 = 𝛾 𝜙 𝑟0
typically computes a new set of feature maps, increasing the model’s memory
footprint. However, DenseNet reduces the need for redundant activations by
reusing feature maps from previous layers and selectively applying transforma-
tions. This method reduces the total number of feature maps that need to be
stored, which in turn lowers the memory requirements without sacrificing accu-
racy. In a standard convolutional network with 𝐿 layers, if each layer generates
𝑘 new feature maps, the total number of feature maps grows linearly:
𝒪(𝐿𝑘)
In contrast, DenseNet reuses feature maps from earlier layers, reducing the
number of feature maps stored. This leads to improved parameter efÏciency
and a reduced memory footprint, which is essential for hardware with limited
memory resources.
Another useful technique is activation checkpointing, which is especially
beneficial during training. In a typical neural network, backpropagation re-
quires storing all forward activations for the backward pass. This can lead to
a significant memory overhead, especially for large models. Activation check-
pointing reduces memory consumption by only storing a subset of activations
and recomputing the remaining ones when needed. If an architecture requires
storing 𝐴total activations, the standard backpropagation method requires the
full storage:
𝒪(𝐴total )
With activation checkpointing, however, only a fraction of activations is
stored, and the remaining ones are recomputed on-the-fly, reducing storage
requirements to:
𝒪(√𝐴total )
This technique can significantly reduce peak memory consumption, making it
particularly useful for training large models on hardware with limited memory.
Parameter reduction is another essential technique, particularly for models
that use large filters. For instance, SqueezeNet uses a novel architecture where
it applies 1 × 1 convolutions to reduce the number of input channels before ap-
plying standard convolutions.30 By first reducing the number of channels with
1 × 1 convolutions, SqueezeNet reduces the model size significantly without 30
SqueezeNet achieves similar
compromising the model’s expressive power. The number of parameters in a accuracy to AlexNet while being 50
standard convolutional layer is: times smaller.
𝒪(𝐶in 𝐶out 𝑘2 )
By reducing 𝐶in using 1 × 1 convolutions, SqueezeNet reduces the number
of parameters, achieving a 50x reduction in model size compared to AlexNet
while maintaining similar performance. This method is particularly valuable
for edge devices that have strict memory and storage constraints.
These memory-efÏcient techniques, including feature reuse, activation check-
pointing, and parameter reduction, are key components of hardware-aware
model design. By minimizing memory usage and efÏciently managing storage,
these techniques allow machine learning models to fit within the memory limits
of modern accelerators, such as GPUs, TPUs, and edge devices. These strategies
10.6. Architectural EfÏciency Optimization 484
Continue
y1 y2
Add + Normalize
Figure 10.27: A Switch Transformer
y
block is an example of Mixture of
Experts (MoE) architecture and an
Add + Normalize architecture that dynamic routes to-
ken computation to subnetworks.
FFN1 FFN2 FFN3 FFN4 FFN1 FFN2 FFN3 FFN4 Source (Fedus, Zoph, and Shazeer
Switching FFN Layer 2021).
p = 0.65 p = 0.8
Add + Normalize
Router Router
Self-Attention
Add + Normalize
Self-Attention
Positional Positional
embedding embedding
x1 x2
More Parameters
‖1{𝑇𝑖𝑗 =0} ‖0
𝑆=
𝑚×𝑛
where 1{𝑇𝑖𝑗 =0} is an indicator function that yields 1 if 𝑇𝑖𝑗 = 0 and 0 otherwise,
and ‖ ⋅ ‖0 represents the L0 norm, which counts the number of non-zero ele-
ments.
Due to the nature of floating-point representations, we often extend this
definition to include elements that are close to zero. This leads to:
‖1{|𝑇𝑖𝑗 |<𝜖} ‖0
𝑆𝜖 =
𝑚×𝑛
where 𝜖 is a small threshold value.
Sparsity can emerge naturally during training, often as a result of regulariza-
tion techniques, or be deliberately introduced through methods like pruning,
where elements below a specific threshold are forced to zero. Effectively exploit-
ing sparsity leads to significant computational efÏciency, memory savings, and
reduced power consumption, which are particularly valuable when deploying
models on devices with limited resources, such as mobile phones, embedded
systems, and edge devices.
like pruning, where weights that are considered less important (often based
on magnitude or other criteria) are removed. While unstructured sparsity is
highly flexible and can be applied to any part of the network, it can be less
efÏcient on hardware since it lacks a predictable structure. In practice, exploiting
unstructured sparsity requires specialized hardware or software optimizations
to make the most of it.
In contrast, structured sparsity involves removing entire components of the
network, such as filters, neurons, or channels, in a more systematic manner. By
eliminating entire parts of the network, structured sparsity is more efÏcient on
hardware accelerators like GPUs or TPUs, which can leverage this structure
for faster computations. Structured sparsity is often used when there is a need
for predictability and efÏciency in computational resources, as it enables the
hardware to fully exploit regular patterns in the network.
2 0 0 1 𝑥1 2𝑥1 + 𝑥4
⎡0 3 0 0⎤ ⎡𝑥2 ⎤ ⎡ 3𝑥2 ⎤
⎢ ⎥⎢ ⎥ = ⎢ ⎥
⎢4 0 5 0⎥ ⎢𝑥3 ⎥ ⎢4𝑥1 + 5𝑥3 ⎥
⎣0 0 0 6⎦ ⎣𝑥4 ⎦ ⎣ 6𝑥4 ⎦
Chapter 10. Model Optimizations 493
Dot Product
= zero entry
K K
(sparse) matrix multiplication on
NVIDIA GPUs. Source PyTorch
Blog
Accumulator (result) Accumulator (result)
N N
A matrix (Dense)
A matrix (parse)
M M M M
GPUs and Sparse Operations. Graphics Processing Units (GPUs) are widely
recognized for their ability to perform highly parallel computations, making
them ideal for handling the large-scale matrix operations that are common
in machine learning. Modern GPUs, such as NVIDIA’s Ampere architecture,
include specialized Sparse Tensor Cores that accelerate sparse matrix multiplica-
tions. These tensor cores are designed to recognize and skip over zero elements
in sparse matrices, thereby reducing the number of operations required (Ab-
delkhalik et al. 2022). This is particularly advantageous for structured pruning
techniques, where entire filters, channels, or layers are pruned, resulting in a
significant reduction in the amount of computation. By skipping over the zero
values, GPUs can speed up matrix multiplications by a factor of two or more,
resulting in lower processing times and reduced power consumption for sparse
networks.
Furthermore, GPUs leverage their parallel architecture to handle multiple
operations simultaneously. This parallelism is especially beneficial for sparse
operations, as it allows the hardware to exploit the inherent sparsity in the
data more efÏciently. However, the full benefit of sparse operations on GPUs
requires that the sparsity is structured in a way that aligns with the underlying
hardware architecture, making structured pruning more advantageous for
optimization (Hoefler, Alistarh, Ben-Nun, Dryden, and Peste 2021).
TPUs and Sparse Optimization. TPUs, developed by Google, are custom-built
hardware accelerators specifically designed to handle tensor computations at a
10.6. Architectural EfÏciency Optimization 496
much higher efÏciency than traditional processors. TPUs, such as TPU v4, have
built-in support for sparse weight matrices, which is particularly beneficial for
models like transformers, including BERT and GPT, that rely on large-scale
matrix multiplications (Jouppi et al. 2021a). TPUs optimize sparse weight
matrices by reducing the computational load associated with zero elements,
enabling faster processing and improved energy efÏciency.
The efÏciency of TPUs comes from their ability to perform operations at
high throughput and low latency, thanks to their custom-designed matrix
multiply units. These units are able to accelerate sparse matrix operations by
directly processing the non-zero elements, making them well-suited for models
that incorporate significant sparsity, whether through pruning or low-rank
approximations. As the demand for larger models increases, TPUs continue to
play a critical role in maintaining performance while minimizing the energy
and computational cost associated with dense computations.
FPGAs and Sparse Computations. Field-Programmable Gate Arrays (FPGAs)
are another important class of hardware accelerators for sparse networks. Un-
like GPUs and TPUs, FPGAs are highly customizable, offering flexibility in their
design to optimize specific computational tasks. This makes them particularly
suitable for sparse operations that require fine-grained control over hardware
execution. FPGAs can be programmed to perform sparse matrix-vector multipli-
cations and other sparse matrix operations with minimal overhead, delivering
high performance for models that use unstructured pruning or require custom
sparse patterns.
One of the main advantages of FPGAs in sparse networks is their ability to be
tailored for specific applications, which allows for optimizations that general-
purpose hardware cannot achieve. For instance, an FPGA can be designed to
skip over zero elements in a matrix by customizing the data path and memory
management, providing significant savings in both computation and memory
usage. FPGAs also allow for low-latency execution, making them well-suited for
real-time applications that require efÏcient processing of sparse data streams.
Memory and Energy Optimization. One of the key challenges in sparse net-
works is managing memory bandwidth, as matrix operations often require
significant memory access. Sparse networks offer a solution by reducing the
number of elements that need to be accessed, thus minimizing memory traf-
fic. Hardware accelerators are optimized for these sparse matrices, utilizing
specialized memory access patterns that skip zero values, reducing the total
amount of memory bandwidth used (Baraglia and Konno 2019).
For example, GPUs and TPUs are designed to minimize memory access
latency by taking advantage of their high memory bandwidth. By accessing
only non-zero elements, these accelerators ensure that memory is used more
efÏciently. The memory hierarchies in these devices are also optimized for
sparse computations, allowing for faster data retrieval and reduced power
consumption.
The reduction in the number of computations and memory accesses directly
translates into energy savings. Sparse operations require fewer arithmetic
operations and fewer memory fetches, leading to a decrease in the energy
consumption required for both training and inference. This energy efÏciency
Chapter 10. Model Optimizations 497
Sparsity and Pruning. Pruning and sparsity are closely related techniques.
Pruning is the process of removing unimportant weights or entire components
from a network, typically resulting in a sparse model. The goal of pruning is to
reduce the number of parameters and operations required during inference, and
it inherently leads to sparsity in the model. However, the interaction between
pruning and sparsity is not always straightforward.
When pruning is applied, the resulting model may become sparse, but the
sparsity pattern, such as whether it is structured or unstructured, affects how
effectively the model can be optimized for hardware. For example, structured
pruning (e.g., pruning entire filters or layers) typically results in more efÏcient
sparsity, as hardware accelerators like GPUs and TPUs are better equipped to
handle regular patterns in sparse matrices (Elsen et al. 2020). Unstructured
pruning, on the other hand, can introduce irregular sparsity patterns, which
may not be as efÏciently processed by hardware, especially when combined
with other techniques like quantization.
Pruning methods often rely on the principle of removing weights that have
little impact on the model’s performance, but when combined with sparsity,
they require careful coordination with hardware-specific optimizations. For
instance, sparse patterns created by pruning need to align with the underlying
hardware architecture to achieve the desired computational savings (Gale, Elsen,
and Hooker 2019b).
Sparsity and Model Design. EfÏcient model design focuses on creating ar-
chitectures that are inherently efÏcient, without the need for extensive post-
training optimizations like pruning or quantization. Techniques like depthwise
separable convolutions, low-rank approximation, and dynamic computation
contribute to sparsity indirectly by reducing the number of parameters or the
computational complexity required by a network.
10.7. AutoML and Model Optimization 500
Preprocess
Train model
data
We will explore the core aspects of AutoML, starting with the key dimensions
of optimization, followed by the methodologies used in AutoML systems, and
concluding with challenges and limitations. By the end, we will understand
how AutoML serves as an integrative framework that unifies many of the
optimization strategies discussed earlier in this chapter.
the gap between academic research and industrial applications, enabling the
widespread deployment of efÏcient machine learning models.
import torch
from torch.quantization import QuantStub, DeQuantStub,
prepare_qat
Beyond static snapshots, trend plots track sparsity progression across multi-
ple pruning iterations. These visualizations illustrate how global model sparsity
evolves, often showing an initial rapid increase followed by more gradual re-
finements. Tools like TensorFlow’s Model Optimization Toolkit and SparseML’s
monitoring utilities provide such tracking capabilities, displaying per-layer
pruning levels over time. These insights allow practitioners to fine-tune prun-
ing strategies by adjusting sparsity constraints for individual layers.
Libraries such as DeepSparse’s visualization suite and PyTorch’s pruning
utilities enable the generation of these visualization tools, helping analyze how
pruning decisions affect different model components. By making sparsity data
visually accessible, these tools help practitioners optimize their models more
effectively.
10.9 Conclusion
This chapter has explored the multifaceted landscape of model optimization, a
critical process for translating machine learning advancements into practical,
real-world systems. We began by recognizing the inherent tension between
model accuracy and efÏciency, driven by constraints such as computational cost,
memory limitations, and energy consumption. This necessitates a systematic
approach to refining models, ensuring they remain effective while operating
within the boundaries of real-world deployment environments.
We examined three core dimensions of model optimization: optimizing
model representation, numerical precision, and architectural efÏciency. Within
each dimension, we delved into specific techniques, such as pruning, knowl-
edge distillation, quantization, and dynamic computation, highlighting their
trade-offs and practical considerations. We also emphasized the importance of
hardware-aware model design, recognizing that aligning model architectures
with the underlying hardware capabilities is crucial for maximizing perfor-
mance and efÏciency.
Finally, we explored AutoML as a holistic approach to model optimization,
automating many of the tasks that traditionally require manual effort and
10.10. Resources 512
10.10 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 11
AI Acceleration
Purpose
How does hardware acceleration impact machine learning system performance, and what
principles should ML engineers understand to effectively design and deploy systems?
Machine learning systems has driven a fundamental shift in computer ar-
chitecture. Traditional processors, designed for general-purpose computing,
prove inefÏcient for the repeated mathematical operations and data movement
patterns in neural networks. Modern accelerators address this challenge by
matching hardware structures to ML computation patterns. These accelerators
introduce fundamental trade-offs in performance, power consumption, and
flexibility. Effective utilization of hardware acceleration requires an under-
standing of these trade-offs, as well as the architectural principles that govern
accelerator design. By optimizing and learning to map models effectively for
specific hardware platforms, engineers can balance computational efÏciency.
513
11.1. Overview 514
L Learning Objectives
11.1 Overview
Machine learning has driven a fundamental shift in computer architecture,
pushing beyond traditional general-purpose processors toward specialized
acceleration. The computational demands of modern machine learning models
exceed the capabilities of conventional CPUs, which were designed for sequen-
tial execution. Instead, machine learning workloads exhibit massive parallelism,
high memory bandwidth requirements, and structured computation patterns
that demand purpose-built hardware for efÏciency and scalability. Machine
Learning Accelerators (ML Accelerators) have emerged as a response to these
challenges.
Definition of ML Accelerator
ators. Many of the principles that shaped the development of early floating-
point and graphics accelerators now inform the design of AI-specific hardware.
Examining these past trends offers a systematic framework for analyzing con-
temporary approaches to AI acceleration and anticipating future developments
in specialized computing.
Floating-Point & Signal 3D Graphics & Real-time Media Coding Deep Learning Tensor Application-Specific
Processing Multimedia & Network Processing Operations Acceleration
Texas Instruments NVIDIA GeForce 256 – Intel IXP2800 Network NVIDIA Tensor Cores Multi-chip and
TMS32010 DSP (1983) First Programmable Processor for DL Acceleration wafer-scale ML
GPU (1999) acceleration
Integration of FPU into Dedicated hardware for AI-specific memory
Rise of SIMD ML frameworks
Intel 486DX (1989) streaming and encoding optimizations
Processing Units optimizing for
specialized hardware
dense = Dense(512)(input_tensor)
for n in range(batch_size):
for m in range(output_size):
sum = bias[m]
for k in range(input_size):
sum += input[n,k] * weights[k,m]
output[n,m] = activation(sum)
The efÏciency gains from vector processing extend beyond instruction count
reduction. Memory bandwidth utilization improves as vector loads transfer
11.3. AI Compute Primitives 526
multiple values per operation. Energy efÏciency increases because control logic
is shared across multiple operations. These improvements compound across
the deep layers of modern neural networks, where billions of operations execute
for each forward pass.
contains 256 × 512 = 131, 072 parameters that define these transformations,
illustrating why efÏcient matrix multiplication becomes crucial for performance.
This matrix processing unit can handle 16 × 16 blocks of the linear layer
computation described earlier, processing 256 multiply-accumulate operations
simultaneously compared to the 8 operations possible with vector processing.
These matrix operations complement vectorized computation by enabling struc-
tured many-to-many transformations. The interplay between matrix and vector
operations shapes the efÏciency of neural network execution.
layer = nn.Sequential(
nn.Linear(256, 512),
nn.ReLU(),
nn.BatchNorm1d(512)
)
output = layer(input_tensor)
a simple ReLU activation introduces branching logic that can disrupt instruction
pipelining—see Listing 11.13 for an example.
A single tensor core instruction processes an entire matrix block while main-
taining intermediate results in local registers, significantly improving com-
Chapter 11. AI Acceleration 535
Control
...
Partial Sums
+ + + +
... Done
Table 11.5: Tensor core and CUDA core precisions across GPU architectures.
Architecture Year Supported Tensor Core Precisions Supported CUDA Core Precisions
Volta 2017 FP16 FP64, FP32, FP16
Chapter 11. AI Acceleration 539
Architecture Year Supported Tensor Core Precisions Supported CUDA Core Precisions
Turing 2018 FP16, INT8 FP64, FP32, FP16, INT8
Ampere 2020 FP64, TF32, bfloat16, FP16, INT8, INT4 FP64, FP32, FP16, bfloat16, INT8
Table 11.6 highlights how execution unit configurations vary across architec-
tures to optimize for different deep learning workloads. Training accelerators
prioritize high-throughput floating-point tensor operations, whereas inference
processors focus on low-precision integer execution for efÏciency. Meanwhile,
mobile accelerators balance precision and power efÏciency to meet real-time
constraints.
While execution units define the compute potential of an accelerator, their
effectiveness is fundamentally constrained by data movement and memory
hierarchy. Achieving high utilization of compute resources requires efÏcient
memory systems that minimize data transfer overhead and optimize local-
ity. The next section explores these architectural challenges, focusing on how
memory hierarchy impacts AI accelerator performance.
being stalled by memory latency and bandwidth constraints is one of the central
challenges in AI acceleration.
Memory Wall
1e+11
1e+07
1e+03
transfer time as
𝑀total
𝑇mem = ,
𝐵mem
where 𝑀total is the total data volume and 𝐵mem is the available memory band-
width. In contrast, the compute time is given by
FLOPs
𝑇compute = ,
𝑃peak
with the number of floating-point operations (FLOPs) divided by the peak hard-
ware throughput, 𝑃peak . When 𝑇mem > 𝑇compute , the system becomes memory-
bound, meaning that the processing elements spend more time waiting for
data than performing computations. This imbalance demonstrates the need
for memory-optimized architectures and efÏcient data movement strategies to
sustain high performance.
Figure 11.6 demonstrates the emerging challenge between model growth
and hardware memory capabilities, illustrating the “AI Memory Wall.” The
figure tracks AI model sizes (red dots) and hardware memory bandwidth (blue
dots) over time on a log scale. Model parameters have grown exponentially,
from AlexNet’s modest 60M parameters in 2012 to Gemini 1’s trillion-scale
parameters in 2023, as shown by the steeper red trend line. In contrast, hard-
ware memory bandwidth, represented by successive generations of NVIDIA
GPUs (~100-200 GB/s) and Google TPUs (~2-3 TB/s), has increased more grad-
ually (blue trend line). The expanding shaded region between these trends
corresponds to the “AI Memory Wall,” which will be an architectural challenge
where model scaling outpaces available memory bandwidth. This growing
disparity necessitates increasingly sophisticated memory management and
model optimization techniques to maintain computational efÏciency.
Gemini 1
GPT−4
6 PaLM
Figure 11.6: Model growth (in GPT−3
parameters) versus memory band-
width (in GB/s).
One key source of irregularity in ML workloads stems from batch size and
execution order. The way input data is processed in batches directly affects
memory reuse, creating a complex optimization challenge. Small batch sizes
decrease the likelihood of reusing cached activations and weights, resulting
in frequent memory fetches from slower, off-chip memory. Larger batch sizes
can improve reuse and amortize memory access costs, but simultaneously
place higher demands on available memory bandwidth, potentially creating
congestion at different memory hierarchy levels. This delicate balance requires
careful consideration of model architecture and available hardware resources.
In addition to batch size, different neural network layers interact with memory
in distinct ways. Convolutional layers benefit from spatial locality, as neigh-
boring pixels in an image are processed together, allowing for efÏcient caching
of small weight kernels. Conversely, fully connected layers require frequent
access to large weight matrices, often leading to more randomized memory
access patterns that poorly align with standard caching policies. Transformers
Chapter 11. AI Acceleration 545
Table 11.8: Memory hierarchy characteristics and their impact on machine learn-
ing.
Approx. Band-
Memory Level Latency width Capacity Example Use in Deep Learning
Registers ~1 cycle Highest Few Storing operands for immediate
values computation
L1/L2 Cache ~1-10 ns High KBs- Caching frequently accessed activations
(SRAM) MBs and small weight blocks
Scratchpad Memory ~5-20 ns High MBs Software-managed storage for
intermediate computations
High-Bandwidth ~100 ns Very High GBs Storing large model parameters and
Memory (HBM) activations for high-speed access
Off-Chip DRAM ~50-150 ns Moderate GBs-TBs Storing entire model weights that do not fit
(DDR, GDDR, on-chip
LPDDR)
Flash Storage ~100 µs - 1 ms Low TBs Storing pre-trained models and
(SSD/NVMe) checkpoints for later loading
dies and using wide memory interfaces, allowing it to transfer large amounts
of data with minimal latency compared to traditional DRAM. Because of its
high bandwidth and lower latency, HBM is often used to store entire layers of
machine learning models that must be accessed quickly during execution. How-
ever, its cost and power consumption limit its use primarily to high-performance
AI accelerators, making it less common in power-constrained environments
such as edge devices.
When a machine learning model exceeds the capacity of on-chip memory
and HBM, it must rely on off-chip DRAM, such as DDR, GDDR, or LPDDR.
While DRAM offers significantly greater storage capacity, its access latency is
higher, meaning that frequent retrievals from DRAM can introduce execution
bottlenecks. To make effective use of DRAM, models must be structured so
that only the necessary portions of weights and activations are retrieved at any
given time, minimizing the impact of long memory fetch times.
At the highest level of the hierarchy, flash storage and solid-state drives (SSDs)
store large pre-trained models, datasets, and checkpointed weights. These
storage devices offer large capacities but are too slow for real-time execution,
requiring models to be loaded into faster memory tiers before computation
begins. For instance, in training scenarios, checkpointed models stored in
SSDs must be loaded into DRAM or HBM before resuming computation, as
direct execution from SSDs would be too slow to maintain efÏcient accelerator
utilization (D. Narayanan et al. 2021a).
The memory hierarchy balances competing objectives of speed, capacity,
and energy efÏciency. However, moving data through multiple memory lev-
els introduces bottlenecks that limit accelerator performance. Data transfers
between memory levels incur latency costs, particularly for off-chip accesses.
Limited bandwidth restricts data flow between memory tiers. Memory capacity
constraints force constant data movement as models exceed local storage.
Store results
Copy the result (4)
eliminates the need for manual data copying. Unified Memory provides an
abstraction that allows both the host and accelerator to access a single, shared
memory space, automatically handling data movement when needed.
With Unified Memory, data does not need to be explicitly copied between
CPU and GPU memory before execution. Instead, when a computation requires
a memory region that is currently located in host memory, the system automat-
ically migrates it to the accelerator, handling step (1) transparently. Similarly,
when computed results are accessed by the CPU, step (4) occurs automatically,
eliminating the need for manual memory management.
Although Unified Memory simplifies programming, it introduces perfor-
mance trade-offs. Since memory migrations occur on demand, they can lead
to unpredictable latencies, particularly if large datasets need to be transferred
frequently. Additionally, since Unified Memory is implemented through page
migration techniques, small memory accesses can trigger excessive data move-
ment, further reducing efÏciency.
For AI workloads that require fine-grained memory control, explicit data
transfers using PCIe, NVLink, and DMA often provide better performance.
However, for applications where ease of development is more important than
absolute speed, Unified Memory offers a convenient alternative.
Each model type presents unique challenges that directly impact accelerator
design. MLPs benefit from fast streaming access to dense weight matrices,
making memory bandwidth a critical factor in performance, especially when
transferring large weights from host memory to accelerator memory. CNNs,
with their high activation reuse and structured memory access patterns, can
leverage on-chip caching and tiling strategies to minimize off-chip memory
Chapter 11. AI Acceleration 553
The challenge is even greater in graph neural networks (GNNs), where com-
putation depends on sparse and dynamically changing graph structures. Unlike
CNNs, which operate on dense and regularly structured data, GNNs must pro-
cess nodes and edges with highly variable degrees of connectivity. Some regions
of a graph may require significantly more computation than others, making
workload balancing across processing elements difÏcult (Zheng et al. 2020).
If computations are not placed strategically, some compute units will sit idle
while others remain overloaded, leading to underutilization and inefÏciencies
in execution.
Poor computation placement adversely affects AI execution by creating work-
load imbalance, inducing excessive data movement, and causing execution stalls
and bottlenecks. Specifically, an uneven distribution of computations can lead
to idle processing elements, thereby preventing full hardware utilization and di-
minishing throughput. In addition, inefÏcient execution assignment increases
memory trafÏc by necessitating frequent data transfers between memory hierar-
chies, which in turn introduces latency and raises power consumption. Finally,
such misallocation can cause operations to wait on data dependencies, resulting
in pipeline inefÏciencies that ultimately lower overall system performance.
Ultimately, computation placement is not just about assigning operations
to processing elements—it is about ensuring that models execute efÏciently
given their unique computational structure. A well-placed workload reduces
execution time, memory overhead, and power consumption, while a poorly
placed one can lead to stalled execution pipelines and inefÏcient resource
utilization. The next section explores the key considerations that must be
addressed to ensure that computation placement is both efÏcient and adaptable
to different model architectures.
Table 11.11: Key challenges in memory allocation and considerations for efÏ-
cient execution.
Challenge Impact on Execution Key Considerations for Allocation
High Memory Slow data access delays execution and Prioritize placing frequently accessed data in
Latency reduces throughput. faster memory locations.
Limited Small local memory constrains the amount Allocate storage efÏciently to maximize data
On-Chip Storage of data available near compute units. availability without exceeding hardware limits.
High Off-Chip Frequent access to external memory Reduce unnecessary memory transfers by
Bandwidth increases delays and power consumption. carefully managing when and how data is
Demand moved.
Irregular Some models require accessing data Organize memory layout to align with access
Memory Access unpredictably, leading to inefÏcient patterns and minimize unnecessary data
Patterns memory usage. movement.
Model-Specific Different models require different Tailor allocation decisions based on the
Memory Needs allocation strategies to optimize structure and execution characteristics of the
performance. workload.
𝒪 = 𝑑!
6!
= 120.
(6 − 3)!
Even for a single layer, there can be hundreds of valid parallelization strate-
gies, each affecting data synchronization, memory contention, and overall com-
pute efÏciency. Expanding this across multiple layers and model architectures
further magnifies the complexity.
Chapter 11. AI Acceleration 563
This highlights how even a single layer may have over a billion possible
memory configurations, making manual optimization impractical.
for a single layer, there are thousands of ways to order computation loops,
hundreds of parallelization strategies, and an exponentially growing number
of memory placement choices. This combinatorial explosion makes exhaustive
search impractical.
To overcome this challenge, AI accelerators rely on structured mapping
strategies that systematically balance computational efÏciency, data locality, and
parallel execution. Rather than evaluating every possible configuration, these
approaches use a combination of heuristic, analytical, and machine learning-
based techniques to find high-performance mappings efÏciently.
The key to effective mapping lies in understanding and applying a set of core
techniques that optimize data movement, memory access, and computation.
These building blocks of mapping strategies provide a structured foundation
for efÏcient execution, which we explore in the next section.
Weight Stationary. The Weight Stationary strategy keeps weights fixed in local
memory, while input activations and partial sums are streamed through the
system. This approach is particularly beneficial in CNNs and matrix multipli-
cations, where the same set of weights is applied across multiple inputs. By
ensuring weights remain stationary, this method reduces redundant memory
fetches, which helps alleviate bandwidth bottlenecks and improves energy
efÏciency.
A key advantage of the weight stationary approach is that it maximizes weight
reuse, reducing the frequency of memory accesses to external storage. Since
weight parameters are often shared across multiple computations, keeping
them in local memory eliminates unnecessary data movement, lowering the
overall energy cost of computation. This makes it particularly effective for
architectures where weights represent the dominant memory overhead, such
as systolic arrays and custom accelerators designed for machine learning.
A simplified Weight Stationary implementation for matrix multiplication is
illustrated in Listing 11.19.
In weight stationary execution, weights are loaded once into local memory
and remain fixed throughout the computation, while inputs are streamed dy-
namically, thereby reducing redundant memory accesses. At the same time,
partial sums are accumulated in an efÏcient manner that minimizes unneces-
sary data movement, ensuring that the system maintains high throughput and
energy efÏciency.
By keeping weights fixed in local storage, memory bandwidth requirements
are significantly reduced, as weights do not need to be reloaded for each new
computation. Instead, the system efÏciently reuses the stored weights across
multiple input activations, allowing for high throughput execution. This makes
weight stationary dataflow highly effective for workloads with heavy weight
reuse patterns, such as CNNs and matrix multiplications.
Chapter 11. AI Acceleration 567
Input Stationary. The Input Stationary strategy keeps input activations fixed
in local memory, while weights and partial sums stream through the system.
This approach is particularly effective for batch processing, transformer models,
and sequence-based architectures, where input activations are reused across
multiple computations. By ensuring that activations remain in local memory,
this method reduces redundant input fetches, improving data locality and
minimizing memory trafÏc.
A key advantage of the Input Stationary approach is that it maximizes input
reuse, reducing the frequency of memory accesses for activations. Since many
models, especially those in natural language processing (NLP) and recommen-
dation systems, process the same input data across multiple computations,
keeping inputs stationary eliminates unnecessary memory transfers, thereby
lowering energy consumption. This strategy is particularly useful when dealing
with large batch sizes, where a single batch of input activations contributes to
multiple weight transformations.
A simplified Input Stationary implementation for matrix multiplication is
illustrated in Listing 11.21.
This implementation follows the core principles of input stationary execution:
• Input activations are loaded into local memory and remain fixed during
computation.
• Weights are streamed dynamically, ensuring efÏcient application across
multiple inputs.
Chapter 11. AI Acceleration 569
• Partial sums are accumulated and written out, optimizing memory band-
width usage.
By keeping input activations stationary, this strategy minimizes redundant
memory accesses to input data, significantly reducing external memory band-
width requirements. This is particularly beneficial in transformer architectures,
where each token in an input sequence is used across multiple attention heads
and layers. Additionally, in batch processing scenarios, keeping input activa-
tions in local memory improves data locality, making it well-suited for fully
connected layers and matrix multiplications.
However, while Input Stationary reduces memory trafÏc for activations, it
introduces trade-offs in weight and output movement. Since weights must be
streamed dynamically while inputs remain fixed, the efÏciency of this approach
depends on how well weights can be delivered to the computational units with-
out causing stalls. Additionally, partial sums must be accumulated efÏciently
before being written back to memory, which may require additional buffering
mechanisms.
The Input Stationary strategy is most effective for workloads where input
activations exhibit high reuse, and memory bandwidth for inputs is a critical
constraint. It is commonly employed in transformers, recurrent networks, and
batch processing workloads, where structured input reuse leads to significant
performance improvements. However, for models where output accumulation
is more critical, alternative dataflow strategies, such as Output Stationary, may
provide better trade-offs.
𝐼(0, 0, 0), 𝐼(0, 0, 1), 𝐼(0, 0, 2), 𝐼(0, 1, 0), 𝐼(0, 1, 1),
𝐼(0, 1, 2), 𝐼(0, 2, 0), 𝐼(0, 2, 1), 𝐼(0, 2, 2), …
Each row is stored contiguously, meaning all pixel values in the first row
are placed sequentially in memory before moving on to the second row. This
ordering is advantageous because CPUs and cache hierarchies are optimized for
sequential memory access. When data is accessed in a row-wise fashion, such
as when applying element-wise operations like activation functions or basic
arithmetic transformations, memory fetches are efÏcient, and cache utilization
is maximized (Sodani 2015).
Chapter 11. AI Acceleration 571
𝐼(0, 0, 0), 𝐼(1, 0, 0), 𝐼(2, 0, 0), 𝐼(0, 1, 0), 𝐼(1, 1, 0), 𝐼(2, 1, 0), … ,
𝐼(0, 0, 1), 𝐼(1, 0, 1), 𝐼(2, 0, 1), … , 𝐼(0, 0, 2), 𝐼(1, 0, 2), 𝐼(2, 0, 2), …
In this format, all red channel values for the entire image are stored first,
followed by all green values, and then all blue values. This ordering allows
hardware accelerators to efÏciently load and process data across channels in par-
allel, which is crucial for convolution operations and SIMD (Single Instruction,
Multiple Data) execution models (Chetlur et al. 2014).
The advantage of channel-major layout becomes clear when performing
convolutions in machine learning models. Convolutional layers process images
by applying a shared set of filters across all channels. When the data is stored
in a channel-major format, a convolution kernel can load an entire channel
efÏciently, reducing the number of scattered memory fetches. This reduces
memory latency, improves throughput, and enhances data locality for matrix
multiplications, which are fundamental to machine learning workloads.
11.6. Optimization Strategies 572
import torch
## Input tensor
X = torch.randn(1024, 1024).cuda()
Even though only the final result 𝑌 is needed, three additional intermediate
tensors consume extra memory without contributing to final output storage.
This excessive memory usage limits scalability and wastes memory bandwidth,
particularly in AI accelerators where minimizing data movement is critical.
Kernel Fusion for Memory EfÏciency. Kernel fusion is a key optimization
technique that aims to minimize intermediate memory writes, reducing the
memory footprint and bandwidth consumption of machine learning workloads
(Zhihao Jia, Zaharia, and Aiken 2018).
Kernel fusion involves merging multiple computation steps into a single, op-
timized operation, eliminating the need for storing and reloading intermediate
tensors. Instead of executing each layer or element-wise operation separately, in
which each step writes its output to memory before the next step begins, fusion
enables direct data propagation between operations, keeping computations
within high-speed registers or local memory.
A common machine learning sequence might involve applying a nonlinear
activation function (e.g., ReLU), followed by batch normalization, and then
scaling the values for input to the next layer. In a naïve implementation, each
of these steps generates an intermediate tensor, which is written to memory,
read back, and then modified again:
𝑋 ′ = ReLU(𝑋)𝑋 ″ = BatchNorm(𝑋 ′ )𝑌 = 𝛼 ⋅ 𝑋 ″ + 𝛽
With kernel fusion, these operations are combined into a single computation
step, allowing the entire transformation to occur without generating unneces-
sary intermediate tensors:
𝑌 = 𝛼 ⋅ BatchNorm(ReLU(𝑋)) + 𝛽
for i in range(N):
for j in range(N):
for k in range(N):
C[i, j] += A[i, k] * B[k, j] # Repeatedly fetching
# A[i, k] and B[k, j]
Ktile
K
Ntile
A matrix C matrix
M
Block m,n
Mtile Mtile
Ktile Ntile
for i in range(N):
for j in range(N):
for k in range(N):
C[i, j] += A[i, k] * B[k, j] # Repeatedly fetching
# A[i, k] and B[k, j]
At first glance, this approach seems correct—it computes the desired result
and follows the mathematical definition. However, the issue lies in how memory
is accessed. Every time the innermost loop runs, it fetches an element from
matrix 𝐴 and matrix 𝐵 from memory, performs a multiplication, and updates
an element in matrix 𝐶. Because matrices are large, the processor frequently
11.6. Optimization Strategies 578
reloads the same values from memory, even though they were just used in
previous computations.
This unnecessary data movement is expensive. Fetching values from main
memory (DRAM) is hundreds of times slower than accessing values stored in
on-chip cache or registers. If the same values must be reloaded multiple times
instead of being stored in fast memory, execution slows down significantly.
Tiling Methods. While the general principle of tiling remains the same, which
involves partitioning large computations into smaller subproblems to improve
memory reuse, there are different ways to apply tiling based on the structure of
the computation and hardware constraints. The two primary tiling strategies
are spatial tiling and temporal tiling. These strategies optimize different aspects
of computation and memory access, and in practice, they are often combined
to achieve the best performance.
Spatial Tiling. Spatial tiling focuses on partitioning data structures into smaller
blocks that fit within the fast memory of the processor. This approach ensures
that each tile is fully processed before moving to the next, reducing redundant
memory accesses. Spatial tiling is widely used in operations such as matrix
multiplication, convolutions, and attention mechanisms in transformer models.
Spatial tiling is illustrated in Listing 11.26, where the computation proceeds
over blocks of the input matrices.
Temporal Tiling. While spatial tiling optimizes how data is partitioned, temporal
tiling focuses on reorganizing the computation itself to improve data reuse
over time. Many machine learning workloads involve operations where the
same data is accessed repeatedly across multiple iterations. Without temporal
tiling, this often results in redundant memory fetches, leading to inefÏciencies.
Temporal tiling, also known as loop blocking, restructures the computation to
ensure that frequently used data stays in fast memory for as long as possible
before moving on to the next computation.
A classic example where temporal tiling is beneficial is convolutional op-
erations, where the same set of weights is applied to multiple input regions.
Without loop blocking, these weights might be loaded from memory multiple
times for each computation. With temporal tiling, the computation is reordered
so that the weights remain in fast memory across multiple inputs, reducing
unnecessary memory fetches and improving overall efÏciency.
Listing 11.27 illustrates a simplified example of loop blocking in matrix mul-
tiplication.
for ii in range(TILE_SIZE):
for jj in range(TILE_SIZE):
for kk in range(TILE_SIZE):
C[i+ii, j+jj] += A_tile[ii, kk] *
B_tile[kk, jj]
Temporal tiling improves performance by ensuring that the data loaded into
fast memory is used multiple times before being evicted. In this implemen-
tation, small tiles of matrices 𝐴 and 𝐵 are explicitly loaded into temporary
storage before performing computations, reducing memory fetch overhead.
This restructuring allows the computation to process an entire tile before mov-
ing to the next, thereby reducing the number of times data must be loaded from
slower memory.
This technique is particularly useful in workloads where certain values are
used repeatedly, such as convolutions, recurrent neural networks (RNNs), and
self-attention mechanisms in transformers. By applying loop blocking, AI
accelerators can significantly reduce memory stalls and improve execution
throughput.
Tiling Challenges and Trade-offs. While tiling significantly improves perfor-
mance by optimizing memory reuse and reducing redundant memory accesses,
Chapter 11. AI Acceleration 581
it introduces several challenges and trade-offs. Selecting the right tile size is
a critical decision, as it directly affects computational efÏciency and memory
bandwidth usage. If the tile size is too small, the benefits of tiling diminish, as
memory fetches still dominate execution time. On the other hand, if the tile size
is too large, it may exceed the available fast memory, causing cache thrashing
and performance degradation.
Load balancing is another key concern. In architectures such as GPUs and
TPUs, computations are executed in parallel across thousands of processing
units. If tiles are not evenly distributed, some units may remain idle while others
are overloaded, leading to suboptimal utilization of computational resources.
Effective tile scheduling ensures that parallel execution remains balanced and
efÏcient.
Data movement overhead is also an important consideration. Although
tiling reduces the number of slow memory accesses, transferring tiles between
different levels of memory still incurs a cost. This is especially relevant in
hierarchical memory systems, where accessing data from cache is much faster
than accessing it from DRAM. EfÏcient memory prefetching and scheduling
strategies are required to minimize latency and ensure that data is available
when needed.
Beyond spatial and temporal tiling, hybrid approaches combine elements
of both strategies to achieve optimal performance. Hybrid tiling adapts to
workload-specific constraints by dynamically adjusting tile sizes or reordering
computations based on real-time execution conditions. For example, some
AI accelerators use spatial tiling for matrix multiplications while employing
temporal tiling for weight reuse in convolutional layers.
In addition to tiling, there are other methods for optimizing memory usage
and computational efÏciency. Techniques such as register blocking, double
buffering, and hierarchical tiling extend the basic tiling principles to further
optimize execution. AI compilers and runtime systems, such as TensorFlow
XLA, TVM, and MLIR, automatically select tiling strategies based on hardware
constraints, allowing for fine-tuned performance optimization without manual
intervention.
Table 11.16 provides a comparative overview of spatial, temporal, and hybrid
tiling approaches, highlighting their respective benefits and trade-offs.
Table 11.16: Comparative analysis of spatial, temporal, and hybrid tiling strate-
gies.
Temporal Tiling (Loop
Aspect Spatial Tiling (Data Tiling) Blocking) Hybrid Tiling
Primary Reduce memory accesses by keeping Increase data reuse across Adapt dynamically to
Goal data in fast memory longer loop iterations workload constraints
Opti- Partitioning data structures into Reordering computations to Balancing spatial and
mization smaller, memory-friendly blocks maximize reuse before temporal reuse strategies
Focus eviction
Memory Improves cache locality and reduces Keeps frequently used data Minimizes data
Usage DRAM access in fast memory for multiple movement while ensuring
iterations high reuse
Common Matrix multiplications, CNNs, Convolutions, recurrent AI accelerators with
Use Cases self-attention in transformers neural networks (RNNs), hierarchical memory,
iterative computations mixed workloads
11.6. Optimization Strategies 582
Optimiza-
tion Trans-
Technique CNNs formers MLPs Rationale
Dataflow Weight Activa- Weight CNNs reuse filters across spatial locations; Transformers
Strategy Station- tion Sta- reuse activations (KV-cache); MLPs reuse weights across
ary Station- tionary batches.
ary
Memory- NCHW NHWC NHWC CNNs favor channel-major for convolution efÏciency;
Aware (Channel- (Row- Transformers and MLPs prioritize row-major for fast
Tensor Major) Major) memory access.
Layouts
Kernel Convolu- Fused GEMM CNNs optimize convolution+activation fusion; Transformers
Fusion tion + Atten- Fusion fuse attention mechanisms; MLPs benefit from fused matrix
Activa- tion multiplications.
tion
Tiling for Spatial Tempo- Blocked CNNs tile along spatial dimensions; Transformers use loop
Memory Tiling ral Tiling blocking to improve sequence memory efÏciency; MLPs use
EfÏciency Tiling blocked tiling for large matrix multiplications.
This table highlights that each machine learning model benefits from a dif-
ferent combination of optimization techniques, reinforcing the importance of
tailoring execution strategies to the computational and memory characteristics
of the workload.
In the following sections, we explore how these optimizations apply to each
network type, explaining how CNNs, Transformers, and MLPs leverage specific
mapping strategies to improve execution efÏciency and hardware utilization.
Given the size of input images and feature maps, tiling is necessary to ensure
that computations fit within fast memory hierarchies. Spatial tiling, where
input feature maps are processed in smaller subregions, allows for efÏcient
utilization of on-chip memory while avoiding excessive off-chip memory trans-
fers. This technique ensures that input activations, weights, and intermediate
outputs remain within high-speed caches or shared memory as long as possible,
reducing memory stalls and improving overall performance.
Together, these optimizations ensure that CNNs make efÏcient use of avail-
able compute resources by maximizing weight reuse, optimizing memory access
patterns, reducing redundant memory writes, and structuring computation to
fit within fast memory constraints.
Table 11.18: Traditional vs. machine learning compilers and their optimization
priorities.
Aspect Traditional Compiler Machine Learning Compiler
Input Linear program code (C, Python) Computational graph (ML models)
Representation
Execution Model Sequential or multi-threaded execution Massively parallel tensor-based execution
Optimization Instruction scheduling, loop unrolling, Graph transformations, kernel fusion,
Priorities register allocation memory-aware execution
Memory Stack and heap memory allocation Tensor layout transformations, tiling,
Management memory-aware scheduling
Target Hardware CPUs (general-purpose execution) GPUs, TPUs, and custom accelerators
Compilation CPU-specific machine code Hardware-specific execution plan (kernels,
Output memory scheduling)
Chapter 11. AI Acceleration 589
the compiler to make fast, reliable decisions about which kernel to use without
requiring extensive analysis.
Profile-guided selection takes a more dynamic approach, benchmarking
different kernel options and choosing the one that performs best for a given
workload. TVM, an open-source AI compiler, uses AutoTVM to empirically
evaluate kernel performance, tuning execution strategies based on real-world
execution times. By testing different kernels before deployment, profile-guided
selection helps ensure that operations are assigned to the most efÏcient imple-
mentation under actual execution conditions.
Another approach, cost model-based selection, relies on performance predic-
tions to estimate execution time and memory consumption for various kernels
before choosing the most efÏcient one. MLIR, a compiler infrastructure de-
signed for machine learning workloads, applies this technique to determine the
most effective tiling and memory access strategies (Lattner et al. 2020). By mod-
eling how different kernels interact with the accelerator’s compute units and
memory hierarchy, the compiler can select the kernel that minimizes execution
cost while maximizing performance.
Many AI compilers also incorporate precision-aware kernel selection, where
the selected kernel is optimized for specific numerical formats such as FP32,
FP16, BF16, or INT8. Training workloads often prioritize higher precision (FP32,
BF16) to maintain model accuracy, whereas inference workloads favor lower
precision (FP16, INT8) to increase speed and reduce power consumption. For
example, an NVIDIA GPU running inference with TensorRT can dynamically
select FP16 or INT8 kernels based on a model’s accuracy constraints. This
trade-off between precision and performance is a key aspect of kernel selection,
especially when deploying models in resource-constrained environments.
Some compilers go beyond static kernel selection and implement adaptive
kernel tuning, where execution strategies are adjusted at runtime based on
the system’s workload and available resources. AutoTVM in TVM measures
kernel performance across different workloads and dynamically refines execu-
tion strategies. TensorRT applies real-time optimizations based on batch size,
memory constraints, and GPU load, adjusting kernel selection dynamically.
Google’s TPU compiler takes a similar approach, optimizing kernel selection
based on cloud resource availability and execution environment constraints.
GPUs often use row-major storage (NHWC format), while TPUs favor channel-
major layouts (NCHW format) to optimize memory coalescing (Martı́n Abadi,
Agarwal, et al. 2016). The compiler automatically transforms tensor layouts
based on the expected access patterns of the target hardware, ensuring that
memory accesses are aligned for maximum efÏciency.
Beyond layout optimization, memory planning also includes buffer alloca-
tion and reuse, where the compiler minimizes memory footprint by reusing
intermediate storage whenever possible. Deep learning workloads generate
many temporary tensors, such as activations and gradients, which can quickly
overwhelm on-chip memory if not carefully managed. Instead of allocating
new memory for each tensor, the compiler analyzes the computation graph to
identify opportunities for buffer reuse, ensuring that intermediate values are
stored and overwritten efÏciently (G. A. Jones 2018).
Another critical aspect of memory planning is minimizing data movement
between different levels of the memory hierarchy. AI accelerators typically
have a mix of high-speed on-chip memory (such as caches or shared SRAM)
and larger, but slower, external DRAM. If tensor data is repeatedly moved be-
tween these memory levels, the model may become memory-bound, reducing
computational efÏciency. To prevent this, compilers use tiling strategies that
break large computations into smaller, memory-friendly chunks, allowing exe-
cution to fit within fast, local memory and reducing the need for costly off-chip
memory accesses.
allocate, reuse, and optimize large tensors, ensuring that memory access pat-
terns align with accelerator-friendly execution. Poor memory management in
AI workloads can lead to performance bottlenecks, particularly due to excessive
off-chip memory transfers and inefÏcient cache usage.
Moreover, AI runtimes are inherently designed for adaptability. While tra-
ditional runtimes often follow a mostly static execution plan, AI workloads
typically operate in highly variable execution environments, such as cloud-
based accelerators or multi-tenant hardware. As a result, AI runtimes must
continuously adjust batch sizes, reallocate compute resources, and manage
real-time scheduling decisions to maintain high throughput and minimize
execution delays.
These distinctions demonstrate why AI runtimes require fundamentally
different execution strategies compared to traditional software runtimes. Rather
than simply managing CPU processes, AI runtimes must oversee large-scale
tensor execution, multi-device coordination, and real-time workload adaptation
to ensure that machine learning models can run efÏciently under diverse and
ever-changing deployment conditions.
Additionally, batch size can influence kernel selection. For workloads that
handle a mix of small and large batches, the AI runtime may choose a latency-
optimized kernel for small batches and a throughput-optimized kernel for
large-scale batch processing. This adjustment ensures that the model continues
to operate efÏciently across different execution scenarios, without the need for
manual tuning.
dedicated memory and compute resources, but they must efÏciently share data
and synchronize execution.
A common example is NVIDIA DGX systems, which integrate multiple GPUs
8 connected via NVLink8 or PCIe9 . This architecture enables workloads to be
NVLink: A high-speed in-
split across GPUs, typically using data parallelism (where each GPU processes
terconnect that enables faster data
a different batch of data) or model parallelism (where different GPUs handle
transfers between GPUs, reducing
different parts of a neural network) (Ben-Nun and Hoefler 2019).
communication bottlenecks.
As illustrated in Figure 11.10, NVSwitch interconnects enable high-speed
9
PCIe (Peripheral Component communication between GPUs, reducing bottlenecks in distributed training.
Interconnect Express): A common However, scaling up the number of GPUs introduces new challenges. Cross-
interface for connecting high-speed GPU communication bandwidth, memory consistency, and workload schedul-
components; however, it typically ing become critical constraints, particularly for large-scale models requiring
offers lower bandwidth compared frequent data exchanges. Unlike chiplets, which leverage high-speed die-to-die
to NVLink for GPU-to-GPU commu- interconnects, discrete GPUs rely on external links, incurring higher latency
nication. and synchronization overhead.
CPU Interconnect
Interconnect
PCIe
PCIe
System System
RAM CPU CPU RAM
NVSwitch
GPU GPU
NVSwitch
Host to Device Copy & Retrieving Results Host to Device Copy & Retrieving Results
11.9.0.4 Wafer-Scale AI
At the frontier of AI scaling, wafer-scale integration represents a paradigm
shift—abandoning traditional multi-chip architectures in favor of a single, mas-
sive AI processor. Rather than partitioning computation across discrete chips,
this approach treats an entire silicon wafer as a unified compute fabric, elimi-
nating the inefÏciencies of inter-chip communication.
As shown in Figure 11.12, Cerebras’ Wafer-Scale Engine (WSE) processors
break away from the historical transistor scaling trends of CPUs, GPUs, and
TPUs. While these architectures have steadily increased transistor counts along
an exponential trajectory, WSE introduces an entirely new scaling paradigm,
integrating trillions of transistors onto a single wafer—far surpassing even the
most advanced GPUs and TPUs. With WSE-3, this trajectory continues, pushing
wafer-scale AI to unprecedented levels (Systems 2021a).
The fundamental advantage of wafer-scale AI is its ultra-fast, on-die com-
munication. Unlike chiplets, GPUs, or TPU Pods, where data must traverse
physical boundaries between separate devices, wafer-scale AI enables near-
instantaneous data transfer across its vast compute array. This architecture
drastically reduces communication latency, unlocking performance levels that
are unachievable with conventional multi-chip systems.
11.9. Multi-Chip AI Acceleration 606
Cerebras WSE−3
100,000
NVIDIA Tesla V100 NVIDIA A100
Intel Pentium
0
Intel 4004
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025 2030
Year
HBM. Techniques such as tiling, data reuse, and kernel fusion ensure that
computations make efÏcient use of fast local memory.
In multi-chip AI systems, each accelerator manages its own local memory,
which necessitates the explicit allocation of model parameters, activations, and
intermediate data across the devices. Unlike single-chip execution where data
is fetched once and reused, multi-chip setups require deliberate strategies to
minimize redundant data transfers, as data must be communicated between
accelerators. Additionally, when overlapping data is processed by multiple
accelerators, the synchronization of shared data can introduce significant over-
head that must be carefully managed to ensure efÏcient execution.
For instance, in multi-GPU deep learning, gradient synchronization across
GPUs is a memory-intensive operation that must be optimized to avoid network
congestion (Shallue, Lee, et al. 2019). In wafer-scale AI, memory allocation
must account for fault tolerance and redundancy mechanisms, ensuring that
defective regions of the wafer do not disrupt execution.
Thus, while memory allocation in single-chip accelerators focuses on local
cache efÏciency, in multi-chip architectures, it must be explicitly coordinated
across accelerators to balance memory bandwidth, minimize redundant trans-
fers, and reduce synchronization overhead.
ensuring that each TPU core receives its required data precisely when needed.
Therefore, while single-chip execution scheduling is focused largely on maxi-
mizing internal parallelism, multi-chip systems require a more holistic approach
that explicitly manages communication overhead and synchronizes workload
distribution across accelerators.
11.10 Conclusion
The rapid advancement of machine learning has fundamentally reshaped com-
puter architecture and system design, driving the need for specialized hardware
11.10. Conclusion 614
11.11 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 12
Benchmarking AI
Purpose
How can quantitative evaluation reshape the development of machine learning systems,
and what metrics reveal true system capabilities?
The measurement and analysis of AI system performance represent a critical
element in bridging theoretical capabilities with practical outcomes. System-
atic evaluation approaches reveal fundamental relationships between model
behavior, resource utilization, and operational reliability. These measurements
draw out the essential trade-offs across accuracy, efÏciency, and scalability, pro-
viding insights that guide architectural decisions throughout the development
lifecycle. These evaluation frameworks establish core principles for assessing
and validating system design choices and enable the creation of robust solu-
tions that meet increasingly complex performance requirements across diverse
deployment scenarios.
617
12.1. Overview 618
L Learning Objectives
12.1 Overview
Computing systems continue to evolve and grow in complexity. Understanding
their performance becomes essential to engineer them better. System evaluation
measures how computing systems perform relative to specified requirements
and goals. Engineers and researchers examine metrics like processing speed, re-
source usage, and reliability to understand system behavior under different con-
ditions and workloads. These measurements help teams identify bottlenecks,
optimize performance, and verify that systems meet design specifications.
Standardized measurement forms the backbone of scientific and engineer-
ing progress. The metric system enables precise communication of physical
quantities. Organizations like the National Institute of Standards and Technol-
ogy maintain fundamental measures from the kilogram to the second. This
standardization extends to computing, where benchmarks provide uniform
methods to quantify system performance. Standard performance tests measure
processor operations, memory bandwidth, network throughput, and other com-
puting capabilities. These benchmarks allow meaningful comparison between
different hardware and software configurations.
Machine learning systems present distinct measurement challenges. Unlike
traditional computing tasks, ML systems integrate hardware performance,
algorithmic behavior, and data characteristics. Performance evaluation must
account for computational efÏciency and statistical effectiveness. Training time,
model accuracy, and generalization capabilities all factor into system assessment.
The interdependence between computing resources, algorithmic choices, and
dataset properties creates new dimensions for measurement and comparison.
These considerations lead us to define machine learning benchmarking as
follows:
Chapter 12. Benchmarking AI 619
Definition of ML Benchmarking
12.3 AI Benchmarks
The evolution of benchmarks reaches its apex in machine learning, reflecting a
journey that parallels the field’s development towards domain-specific applica-
tions. Early machine learning benchmarks focused primarily on algorithmic
performance, measuring how well models could perform specific tasks (Lecun
et al. 1998). As machine learning applications scaled and computational de-
mands grew, the focus expanded to include system performance and hardware
efÏciency (Jouppi, Young, et al. 2017a). Most recently, the critical role of data
quality has emerged as the third essential dimension of evaluation (Gebru et al.
2021b).
What sets AI benchmarks apart from traditional performance metrics is their
inherent variability, introducing accuracy as a fundamental dimension of eval-
uation. Unlike conventional benchmarks, which measure fixed, deterministic
characteristics like computational speed or energy consumption, AI benchmarks
must account for the probabilistic nature of machine learning models. The
same system can produce different results depending on the data it encounters,
making accuracy a defining factor in performance assessment. This distinction
adds complexity, as benchmarking AI systems requires not only measuring raw
computational efÏciency but also understanding trade-offs between accuracy,
generalization, and resource constraints.
The growing complexity and ubiquity of machine learning systems demand
comprehensive benchmarking across all three dimensions: algorithmic models,
hardware systems, and training data. This multifaceted evaluation approach
represents a significant departure from earlier benchmarks that could focus
on isolated aspects like computational speed or energy efÏciency (Hernandez
and Brown 2020). Modern machine learning benchmarks must address the
sophisticated interplay between these dimensions, as limitations in any one
area can fundamentally constrain overall system performance.
This evolution in benchmark complexity mirrors the field’s deepening under-
standing of what drives machine learning system success. While algorithmic
innovations initially dominated progress metrics, the challenges of deploying
models at scale revealed the critical importance of hardware efÏciency (Jouppi
et al. 2021b). Subsequently, high-profile failures of machine learning systems
in real-world deployments highlighted how data quality and representation
fundamentally determine system reliability and fairness (Bender et al. 2021).
Understanding how these dimensions interact has become essential for accu-
rately assessing machine learning system performance, informing development
decisions, and measuring technological progress in the field.
Chapter 12. Benchmarking AI 623
30 Baseline
ZFNet
10
VGGNet
GoogleNet
ResNet
0
2010 2011 2012 2013 2014 2015
Year
illustrates the correlation between ImageNet classification error rates and GPU
adoption from 2010 to 2014. These results clearly highlight how improved
hardware capabilities, combined with algorithmic advances, drove significant
progress in computer vision performance.
20
75
50
10
25
0 0
2010 2011 2012 2013 2014
Year
factors such as background processes, thermal conditions, and power states that
might affect performance measurements. The harness must also provide mech-
anisms for collecting and logging performance metrics without significantly
impacting the system under test.
pend on whether the system must operate in real-time or can process data in
batches.
The benchmark reveals inherent trade-offs between performance metrics
in machine learning systems. For instance, reducing the model size from
270 Kparameters might improve processing speed and energy efÏciency but
could decrease the 0.86 AUC detection accuracy. Figure 12.4 illustrates how
these interconnected metrics contribute to overall system performance in the
deployment phase.
Whether these measurements constitute a “passing” benchmark depends on
the specific requirements of the intended application. The benchmark frame-
work provides the structure and methodology for consistent evaluation, while
the acceptance criteria must align with deployment constraints and performance
requirements.
12.5.4 Trade-offs
As shown in Table 12.1, different challenges emerge at different stages of an
AI system’s lifecycle. Each benchmarking approach provides unique insights:
micro-benchmarks help engineers optimize specific components like GPU ker-
nel implementations or data loading operations, macro-benchmarks guide
model architecture decisions and algorithm selection, while end-to-end bench-
marks reveal system-level bottlenecks in production environments.
ers to assess how different design choices, including model architectures, data
loading mechanisms, hardware configurations, and distributed training strate-
gies, impact performance. These benchmarks are particularly vital as machine
learning systems grow in scale, requiring billions of parameters, terabytes of
data, and distributed computing environments.
For instance, large-scale models like OpenAI’s GPT-3 (T. B. Brown, Mann,
Ryder, Subbiah, Kaplan, and al. 2020), which consists of 175 billion parameters
trained on 45 terabytes of data, highlight the immense computational demands
of training. Benchmarks enable systematic evaluation of the underlying systems
to ensure that hardware and software configurations can meet these demands
efÏciently.
EfÏcient data storage and delivery during training also play a major role in the
training process. For instance, in a machine learning model that predicts bound-
ing boxes around objects in an image, thousands of images may be required.
However, loading an entire image dataset into memory is typically infeasible,
so practitioners rely on data loaders from ML frameworks. Successful model
training depends on timely and efÏcient data delivery, making it essential to
benchmark tools like data pipelines, preprocessing speed, and storage retrieval
times to understand their impact on training performance.
Hardware selection is another key factor in training machine learning systems,
as it can significantly impact training time. Training benchmarks evaluate CPU,
GPU, memory, and network utilization during the training phase to guide
system optimizations. Understanding how resources are used is essential:
Are GPUs being fully leveraged? Is there unnecessary memory overhead?
Benchmarks can uncover bottlenecks or inefÏciencies in resource utilization,
leading to cost savings and performance improvements.
In many cases, using a single hardware accelerator, such as a single GPU, is
insufÏcient to meet the computational demands of large-scale model training.
Machine learning models are often trained in data centers with multiple GPUs
or TPUs, where distributed computing enables parallel processing across nodes.
Training benchmarks assess how efÏciently the system scales across multiple
nodes, manages data sharding, and handles challenges like node failures or
drop-offs during training.
12.6. Training Benchmarks 640
12.6.1 Motivation
From a systems perspective, training machine learning models is a compu-
tationally intensive process that requires careful optimization of resources.
Training benchmarks serve as essential tools for evaluating system efÏciency,
identifying bottlenecks, and ensuring that machine learning systems can scale
effectively. They provide a standardized approach to measuring how various
system components, including hardware accelerators, memory, storage, and
network infrastructure, affect training performance.
Training benchmarks enable researchers and engineers to push the state-
of-the-art, optimize configurations, improve scalability, and reduce overall
resource consumption by systematically evaluating these factors. As shown in
Figure 12.6, the performance improvements in progressive versions of MLPerf
Training benchmarks have consistently outpaced Moore’s Law, which demon-
strates that what gets measured gets improved. Using standardized bench-
marking trends allows us to rigorously showcase the rapid evolution of ML
computing.
deep learning models are trained across multiple GPUs or TPUs, requiring
efÏcient parallelization strategies to ensure that additional computing resources
lead to meaningful performance improvements. Training benchmarks measure
how well a system scales by evaluating system throughput, memory efÏciency,
and overall training time as additional computational resources are introduced.
Effective scaling is not always guaranteed. While adding more GPUs or TPUs
should, in theory, reduce training time, issues such as communication over-
head, data synchronization latency, and memory bottlenecks can limit scaling
efÏciency. Training benchmarks help identify these challenges by quantifying
how performance scales with increasing hardware resources. A well-designed
system should exhibit near-linear scaling, where doubling the number of GPUs
results in a near-halving of training time. However, real-world inefÏciencies
often prevent perfect scaling, and benchmarks provide the necessary insights
to optimize system design accordingly.
Another crucial factor in training efÏciency is time-to-accuracy, which mea-
sures how quickly a model reaches a target accuracy level. Achieving faster
convergence with fewer computational resources is a key goal in training opti-
mization, and benchmarks help compare different training methodologies to
determine which approaches strike the best balance between speed and accu-
racy. By leveraging training benchmarks, system designers can assess whether
their infrastructure is capable of handling large-scale workloads efÏciently
while maintaining training stability and accuracy.
12.6.2 Metrics
Evaluating the performance of machine learning training requires a set of well-
defined metrics that go beyond conventional algorithmic measures. From a
systems perspective, training benchmarks assess how efÏciently and effectively
a machine learning model can be trained to a predefined accuracy threshold.
Metrics such as throughput, scalability, and energy efÏciency are only mean-
ingful in relation to whether the model successfully reaches its target accuracy.
Without this constraint, optimizing for raw speed or resource utilization may
lead to misleading conclusions.
Training benchmarks, such as MLPerf Training, define specific accuracy
targets for different machine learning tasks, ensuring that performance mea-
surements are made in a fair and reproducible manner. A system that trains a
model quickly but fails to reach the required accuracy is not considered a valid
benchmark result. Conversely, a system that achieves the best possible accuracy
but takes an excessive amount of time or resources may not be practically use-
ful. Effective benchmarking requires balancing speed, efÏciency, and accuracy
convergence.
This metric ensures that benchmarking focuses on how quickly and effectively
a system can achieve meaningful results.
12.6. Training Benchmarks 644
𝑁samples
𝑇=
𝑇train
where 𝑁samples is the total number of training samples processed. However,
throughput alone does not guarantee meaningful results, as a model may
process a large number of samples quickly without necessarily reaching the
desired accuracy.
For example, in MLPerf Training, the benchmark for ResNet-50 may require
reaching an accuracy target like 75.9% top-1 on the ImageNet dataset. A system
that processes 10,000 images per second but fails to achieve this accuracy is
not considered a valid benchmark result, while a system that processes fewer
images per second but converges efÏciently is preferable. This highlights why
throughput must always be evaluated in relation to time-to-accuracy rather
than as an independent performance measure.
GPUs or TPUs, are actively engaged during training. Low utilization may
indicate bottlenecks in data movement, memory access, or inefÏcient workload
scheduling.
For instance, when training BERT on a TPU cluster, researchers observed
that input pipeline inefÏciencies were limiting overall throughput. Although
the TPUs had high raw compute power, the system was not keeping them
fully utilized due to slow data retrieval from storage. By profiling the resource
utilization, engineers identified the bottleneck and optimized the input pipeline
using TFRecord and data prefetching, leading to improved performance.
Memory bandwidth is another critical factor, as deep learning models require
frequent access to large volumes of data during training. If memory bandwidth
becomes a limiting factor, increasing compute power alone will not improve
training speed. Benchmarks assess how well models leverage available memory,
ensuring that data transfer rates between storage, main memory, and processing
units do not become performance bottlenecks.
I/O performance also plays a significant role in training efÏciency, partic-
ularly when working with large datasets that cannot fit entirely in memory.
Benchmarks evaluate the efÏciency of data loading pipelines, including prepro-
cessing operations, caching mechanisms, and storage retrieval speeds. Systems
that fail to optimize data loading can experience significant slowdowns, regard-
less of computational power.
Training time and throughput are often the first metrics considered when
evaluating system performance. Time-to-accuracy, the duration required for
a model to achieve a specified accuracy level, is a practical and widely used
benchmark. Throughput, typically measured in samples per second, provides
insight into how efÏciently data is processed during training. For example,
when comparing a ResNet-50 model trained on NVIDIA A100 versus V100
GPUs, the A100 generally offers higher throughput and faster convergence.
However, it is important to ensure that increased throughput does not come
at the expense of convergence quality, especially when reduced numerical
precision (e.g., TF32) is used to speed up computation.
As model sizes continue to grow, scalability becomes a critical performance
dimension. EfÏcient use of multiple GPUs or TPUs is essential for training
large models such as GPT-3 or T5. In this context, scaling efÏciency and com-
munication overhead are key metrics. A system might scale linearly up to
64 GPUs, but beyond that, performance gains may taper off due to increased
synchronization and communication costs. Benchmarking tools that monitor
interconnect bandwidth and gradient aggregation latency can reveal how well
a system handles distributed training.
Resource utilization complements these measures by examining how effec-
tively a system leverages its compute and memory resources. Metrics such as
GPU utilization, memory bandwidth, and data loading efÏciency help identify
performance bottlenecks. For instance, a BERT pretraining task that exhibits
only moderate GPU utilization may be constrained by an underperforming
data pipeline. Optimizations like sharding input files or prefetching data into
device memory can often resolve these inefÏciencies.
In addition to raw performance, energy efÏciency and cost have become
increasingly important considerations. Training large models at scale can con-
sume significant power, raising environmental and financial concerns. Metrics
such as energy consumed per training run and performance per watt (e.g.,
TOPS/W) help evaluate the sustainability of different hardware and system
configurations. For example, while two systems may reach the same accuracy
in the same amount of time, the one that uses significantly less energy may be
preferred for long-term deployment.
Fault tolerance and robustness address how well a system performs under
non-ideal conditions, which are common in real-world deployments. Training
jobs frequently encounter hardware failures, preemptions, or network instability.
Metrics like checkpoint overhead and recovery success rate provide insight into
the resilience of a training system. In practice, checkpointing can introduce
non-trivial overhead—for example, pausing training every 30 minutes to write
a full checkpoint may reduce overall throughput by 5-10%. Systems must strike
a balance between failure recovery and performance impact.
Finally, reproducibility and standardization ensure that benchmark results
are consistent, interpretable, and transferable. Even minor differences in soft-
12.6. Training Benchmarks 648
100,000
10,000
1,000
100
10
12.7.1 Motivation
Deploying machine learning models for inference introduces a unique set of
challenges distinct from training. While training optimizes large-scale com-
putation over extensive datasets, inference must deliver predictions efÏciently
and at scale in real-world environments. Inference benchmarks provide a sys-
tematic approach to evaluating system performance, identifying bottlenecks,
and ensuring that models can operate effectively across diverse deployment
scenarios.
Unlike training, which typically runs on dedicated high-performance hard-
ware, inference must adapt to varying constraints. A model deployed in a
cloud server might prioritize high-throughput batch processing, while the
same model running on a mobile device must operate under strict latency
and power constraints. On edge devices with limited compute and memory,
optimizations such as quantization and pruning become critical. Benchmarks
help assess these trade-offs, ensuring that inference systems maintain the right
balance between accuracy, speed, and efÏciency across different platforms.
Inference benchmarks help answer fundamental questions about model de-
ployment. How quickly can a model generate predictions in real-world condi-
tions? What are the trade-offs between inference speed and accuracy? Can an
inference system handle increasing demand while maintaining low latency?
By evaluating these factors, benchmarks guide optimizations in both hardware
and software to improve overall efÏciency (Reddi et al. 2019).
12.7. Inference Benchmarks 652
12.7.2 Metrics
Evaluating the performance of inference systems requires a distinct set of
metrics from those used for training. While training benchmarks emphasize
throughput, scalability, and time-to-accuracy, inference benchmarks must focus
on latency, efÏciency, and resource utilization in practical deployment settings.
These metrics ensure that machine learning models perform well across dif-
ferent environments, from cloud data centers handling millions of requests to
mobile and edge devices operating under strict power and memory constraints.
Unlike training, where the primary goal is to optimize learning speed, infer-
ence benchmarks evaluate how efÏciently a trained model can process inputs
and generate predictions at scale. The following sections describe the most
important inference benchmarking metrics, explaining their relevance and how
they are used to compare different systems.
but instead loaded on demand when needed. This can introduce significant de-
lays, particularly in serverless AI7 environments, where resources are allocated
7
Serverless AI: A deployment dynamically based on incoming requests. Cold-start performance measures
model where inference workloads how quickly a system can transition from idle to active execution, ensuring that
are executed on demand, eliminat- inference is available without excessive wait times.
ing the need for dedicated compute Model load time refers to the duration required to load a trained model into
resources but introducing cold-start memory before it can process inputs. In some cases, particularly on resource-
latency challenges. limited devices, models must be reloaded frequently to free up memory for other
applications. The time taken for the first inference request is also an important
consideration, as it reflects the total delay users experience when interacting
with an AI-powered service. Benchmarks help quantify these delays, ensuring
that inference systems can meet real-world responsiveness requirements.
efÏciency and speed but may lead to minor accuracy degradation. A thoughtful
evaluation must balance these trade-offs to align with the intended application.
The deployment environment also plays a significant role in determining
evaluation priorities. Cloud-based systems often prioritize scalability and
adaptability to dynamic workloads, while mobile and edge systems require
careful attention to memory usage and energy efÏciency. These differing priori-
ties mean that benchmarks must be tailored to the context of the system’s use,
rather than relying on one-size-fits-all evaluations.
Ultimately, evaluating inference performance requires a holistic approach.
Focusing on a single metric, such as latency or energy efÏciency, provides an in-
complete picture. Instead, all relevant dimensions must be considered together
to ensure that the system meets its functional, resource, and performance goals
in a balanced way.
Linear Scaling Assumption. Inference performance does not always scale pro-
portionally with additional resources. Bottlenecks such as memory bandwidth,
thermal limits, or communication overhead can limit the benefits of adding
more GPUs or TPUs. Benchmarks that assume linear scaling behavior may
overestimate system performance, particularly in distributed deployments.
This dramatic range in power requirements, which spans over four orders of
magnitude, presents significant challenges for measurement and benchmark-
ing. Creating a unified methodology requires careful consideration of each
scale’s unique characteristics. For example, accurately measuring microwatt-
level consumption in TinyML devices demands different instrumentation and
techniques than monitoring kilowatt-scale server racks. Any comprehensive
benchmarking framework must accommodate these vastly different scales while
ensuring measurements remain consistent, fair, and reproducible across diverse
hardware configurations.
The diagram is organized into three categories, Tiny, Inference, and Training
examples, each reflecting different measurement scopes based on system archi-
tecture and deployment environment. In TinyML systems, the entire low-power
SoC, including compute, memory, and basic interconnects, typically falls within
the measurement boundary. Inference nodes introduce more complexity, incor-
porating multiple SoCs, local storage, accelerators, and memory, while often
excluding remote storage and off-chip components. Training deployments span
multiple racks, where only selected elements, including compute nodes and
network switches, are measured, while storage systems, cooling infrastructure,
and parts of the interconnect fabric are often excluded.
System-level power measurement offers a more holistic view than measuring
individual components in isolation. While component-level metrics (e.g., ac-
celerator or processor power) are valuable for performance tuning, real-world
ML workloads involve intricate interactions between compute units, memory
systems, and supporting infrastructure. For instance, memory-bound inference
tasks can consume up to 60% of total system power on data movement alone.
Shared infrastructure presents additional challenges. In data centers, re-
sources such as cooling systems and power delivery are shared across work-
loads, complicating attribution of energy use to specific ML tasks. Cooling
alone can account for 20-30% of total facility power consumption, making it a
major factor in energy efÏciency assessments (Barroso, Clidaras, and Hölzle
2013). Even at the edge, components like memory and I/O interfaces may serve
both ML and non-ML functions, further blurring measurement boundaries.
Modern hardware also introduces variability through dynamic power man-
agement. Features like dynamic voltage and frequency scaling (DVFS) can cause
power consumption to vary by 30-50% for the same ML model, depending on
system load and concurrent activity.
Finally, support infrastructure, with a particular emphasis on cooling, has a
significant impact on total energy use in large-scale deployments. Data centers
must maintain operational temperatures, typically between 20-25°C, to ensure
system reliability. Cooling overhead is captured in the Power Usage Effective-
ness (PUE) metric, which ranges from 1.1 in highly efÏcient facilities to over 2.0
in less optimized ones (Barroso, Hölzle, and Ranganathan 2019). Even edge
devices require basic thermal management, with cooling accounting for 5-10%
of overall power consumption.
Chapter 12. Benchmarking AI 663
While power measurement techniques, such as SPEC Power, have long ex-
isted for general computing systems (Lange 2009), machine learning workloads
present unique challenges that require specialized measurement approaches.
Machine learnign systems exhibit distinct power consumption patterns char-
acterized by phases of intense computation interspersed with data movement
and preprocessing operations. These patterns vary significantly across different
types of models and tasks. A large language model’s power profile looks very
different from that of a computer vision inference task.
Direct power measurement requires careful consideration of sampling rates
and measurement windows. For example, transformer model inference creates
short, intense power spikes during attention computations, requiring high-
frequency sampling (> 1 KHz) to capture accurately. In contrast, CNN inference
tends to show more consistent power draw patterns that can be captured with
lower sampling rates. The measurement duration must also account for ML-
specific behaviors like warm-up periods, where initial inferences may consume
more power due to cache population and pipeline initialization.
12.8. Energy EfÏciency Measurement 664
not properly addressed, they can undermine the credibility and usefulness
of benchmarking results. One of the most fundamental issues is incomplete
problem coverage. Many benchmarks, while useful for controlled comparisons,
fail to capture the full diversity of real-world applications. For instance, common
image classification datasets, such as CIFAR-10, contain a limited variety of
images. As a result, models that perform well on these datasets may struggle
when applied to more complex, real-world scenarios with greater variability in
lighting, perspective, and object composition.
Another challenge is statistical insignificance, which arises when benchmark
evaluations are conducted on too few data samples or trials. For example,
testing an optical character recognition (OCR) system on a small dataset may
not accurately reflect its performance on large-scale, noisy text documents.
Without sufÏcient trials and diverse input distributions, benchmarking results
may be misleading or fail to capture true system reliability.
Reproducibility is also a major concern. Benchmark results can vary signifi-
cantly depending on factors such as hardware configurations, software versions,
and system dependencies. Small differences in compilers, numerical precision,
or library updates can lead to inconsistent performance measurements across
different environments. To mitigate this issue, MLPerf addresses reproducibility
by providing reference implementations, standardized test environments, and
strict submission guidelines. Even with these efforts, achieving true consistency
across diverse hardware platforms remains an ongoing challenge.
A more fundamental limitation of benchmarking is the risk of misalignment
with real-world goals. Many benchmarks emphasize metrics such as speed,
accuracy, and throughput, but practical AI deployments often require balancing
multiple objectives, including power efÏciency, cost, and robustness. A model
that achieves state-of-the-art accuracy on a benchmark may be impractical for
deployment if it consumes excessive energy or requires expensive hardware.
Furthermore, benchmarks can quickly become outdated due to the rapid evo-
lution of AI models and hardware. New techniques may emerge that render
existing benchmarks less relevant, necessitating continuous updates to keep
benchmarking methodologies aligned with state-of-the-art developments.
While these challenges affect all benchmarking efforts, the most pressing
concern is the role of benchmark engineering, which introduces the risk of
over-optimization for specific benchmark tasks rather than meaningful im-
provements in real-world performance.
Similarly, variations in altitude can impact cooling system efÏciency and hard
drive performance due to changes in air pressure.
Operational environmental factors encompass the broader system context in
which benchmarks are executed. This includes background processes running
on the system, network conditions, and power supply stability. The presence
of other active programs or services can compete for computational resources,
potentially altering the performance characteristics of the model under eval-
uation. To ensure the validity and reproducibility of benchmark results, it is
essential to document and control these environmental conditions to the extent
possible. This may involve conducting experiments in temperature-controlled
environments, monitoring and reporting ambient conditions, standardizing
the operational state of benchmark systems, and documenting any background
processes or system loads.
In scenarios where controlling all environmental variables is impractical,
such as in distributed or cloud-based benchmarking, it becomes essential to
report these conditions in detail. This information allows other researchers to
account for potential variations when interpreting or attempting to reproduce
results. As machine learning models are increasingly deployed in diverse real-
world environments, understanding the impact of environmental conditions
on model performance becomes even more critical. This knowledge not only
ensures more accurate benchmarking but also informs the development of
robust models capable of consistent performance across varying operational
conditions.
The need for evolving benchmarks also presents a challenge: stability ver-
sus adaptability. On the one hand, benchmarks must remain stable for long
enough to allow meaningful comparisons over time. If benchmarks change
too frequently, it becomes difÏcult to track long-term progress and compare
new results with historical performance. On the other hand, failing to update
benchmarks leads to stagnation, where models are optimized for outdated tasks
rather than advancing the field. Striking the right balance between benchmark
longevity and adaptation is an ongoing challenge for the AI community.
Despite these difÏculties, evolving benchmarks is essential for ensuring that
AI progress remains meaningful. Without updates, benchmarks risk becoming
detached from real-world needs, leading researchers and engineers to focus
on optimizing models for artificial test cases rather than solving practical chal-
lenges. As AI continues to expand into new domains, benchmarking must keep
pace, ensuring that performance evaluations remain relevant, fair, and aligned
with real-world deployment scenarios.
GPT-3’s training corpus (T. B. Brown, Mann, Ryder, Subbiah, Kaplan, and al.
2020) have pushed the boundaries of model capabilities even further.
However, model benchmarks face significant limitations, particularly in the
era of Large Language Models (LLMs). Beyond the traditional challenge of
models failing in real-world conditions, commonly referred to as the Sim2Real
gap, a new form of benchmark optimization has emerged, analogous to but
distinct from classical benchmark engineering in computer systems. In tra-
ditional systems evaluation, developers would explicitly optimize their code
implementations to perform well on benchmark suites like SPEC or TPC, which
we discussed earlier under “Benchmark Engineering”. In the case of LLMs, this
phenomenon manifests through data rather than code: benchmark datasets
may become embedded in training data, either inadvertently through web-
scale training or deliberately through dataset curation (R. Xu et al. 2024). This
creates fundamental challenges for model evaluation, as high performance
on benchmark tasks may reflect memorization rather than genuine capability.
The key distinction lies in the mechanism: while systems benchmark engineer-
ing occurred through explicit code optimization, LLM benchmark adaptation
can occur implicitly through data exposure during pre-training, raising new
questions about the validity of current evaluation methodologies.
These challenges extend beyond just LLMs. Traditional machine learning
systems continue to struggle with problems of overfitting and bias. The Gen-
der Shades project (Buolamwini and Gebru 2018), for instance, revealed that
commercial facial recognition models performed significantly worse on darker-
skinned individuals, highlighting the critical importance of fairness in model
evaluation. Such findings underscore the limitations of focusing solely on
aggregate accuracy metrics.
Moving forward, we must fundamentally rethink its approach to benchmark-
ing. This evolution requires developing evaluation frameworks that go beyond
traditional metrics to assess multiple dimensions of model behavior—from
generalization and robustness to fairness and efÏciency. Key challenges in-
clude creating benchmarks that remain relevant as models advance, developing
methodologies that can differentiate between genuine capabilities and artificial
performance gains, and establishing standards for benchmark documentation
and transparency. Success in these areas will help ensure that benchmark results
provide meaningful insights about model capabilities rather than reflecting
artifacts of training procedures or evaluation design.
test images, though nearly illegible to humans, were assigned specific labels
during the dataset’s creation in 1994. When models correctly predict these labels,
their apparent superhuman performance may actually reflect memorization of
dataset artifacts rather than true digit recognition capabilities.
These challenges extend beyond individual domains. The provocative ques-
tion “Are we done with ImageNet?” (Beyer et al. 2020) highlights broader
concerns about the limitations of static benchmarks. Models optimized for
fixed datasets often struggle with distribution shifts—real-world changes that
occur after training data collection. This limitation has driven the development
of dynamic benchmarking approaches, such as Dynabench (Kiela et al. 2021),
which continuously evolves test data based on model performance to maintain
benchmark relevance.
Current data benchmarking efforts encompass several critical dimensions.
Label quality assessment remains a central focus, as explored in DataPerf’s
debugging challenge. Initiatives like MSWC (Mazumder et al. 2021) for speech
recognition address bias and representation in datasets. Out-of-distribution
generalization receives particular attention through benchmarks like RxRx and
WILDS (Koh et al. 2021). These diverse efforts reflect a growing recognition
that advancing AI capabilities requires not just better models and systems, but
fundamentally better approaches to data quality assessment and benchmark
design.
12.11 Conclusion
“What gets measured gets improved.” Benchmarking plays a foundational role
in the advancement of AI, providing the essential measurements needed to
Chapter 12. Benchmarking AI 675
track progress, identify limitations, and drive innovation. This chapter has
explored the multifaceted nature of benchmarking, spanning systems, models,
and data, and has highlighted its critical role in optimizing AI performance
across different dimensions.
ML system benchmarks enable optimizations in speed, efÏciency, and scal-
ability, ensuring that hardware and infrastructure can support increasingly
complex AI workloads. Model benchmarks provide standardized tasks and
evaluation metrics beyond accuracy, driving progress in algorithmic innovation.
Data benchmarks, meanwhile, reveal key issues related to data quality, bias, and
representation, ensuring that AI models are built on fair and diverse datasets.
While these components, systems, models, and data, are often evaluated
in isolation, future benchmarking efforts will likely adopt a more integrated
approach. By measuring the interplay between system, model, and data bench-
marks, AI researchers and engineers can uncover new insights into the co-
design of data, algorithms, and infrastructure. This holistic perspective will be
essential as AI applications grow more sophisticated and are deployed across
increasingly diverse environments.
Benchmarking is not static—it must continuously evolve to capture new AI
capabilities, address emerging challenges, and refine evaluation methodologies.
As AI systems become more complex and influential, the need for rigorous,
transparent, and socially beneficial benchmarking standards becomes even
more pressing. Achieving this requires close collaboration between indus-
try, academia, and standardization bodies to ensure that benchmarks remain
relevant, unbiased, and aligned with real-world needs.
Ultimately, benchmarking serves as the compass that guides AI progress. By
persistently measuring and openly sharing results, we can navigate toward AI
systems that are performant, robust, and trustworthy. However, benchmarking
must also be aligned with human-centered principles, ensuring that AI serves
society in a fair and ethical manner. The future of benchmarking is already
expanding into new frontiers, including the evaluation of AI safety, fairness, and
generative AI models, which will shape the next generation of AI benchmarks.
These topics, while beyond the scope of this chapter, will be explored further
in the discussion on Generative AI.
12.12. Resources 676
12.12 Resources
Slides
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 13
ML Operations
677
13.1. Overview 678
L Learning Objectives
13.1 Overview
Machine Learning Operations (MLOps) is a systematic discipline that integrates
machine learning, data science, and software engineering practices to automate
and streamline the end-to-end ML lifecycle. This lifecycle encompasses data
preparation, model training, evaluation, deployment, monitoring, and ongoing
maintenance. The goal of MLOps is to ensure that ML models are developed,
deployed, and operated reliably, efÏciently, and at scale.
To ground the discussion, consider a conventional ML application involving
centralized infrastructure. A ridesharing company may aim to predict real-time
rider demand using a machine learning model. The data science team might
invest significant time designing and training the model. However, when it
comes time to deploy it, the model often needs to be reengineered to align with
the engineering team’s production requirements. This disconnect can introduce
weeks of delay and engineering overhead. MLOps addresses this gap.
By establishing standard protocols, tools, and workflows, MLOps enables
models developed during experimentation to transition seamlessly into pro-
duction. It promotes collaboration across traditionally siloed roles, including
data scientists, ML engineers, and DevOps professionals, by defining interfaces
and responsibilities. MLOps also supports continuous integration and delivery
for ML, allowing teams to retrain, validate, and redeploy models frequently in
response to new data or system conditions.
Returning to the ridesharing example, a mature MLOps practice would allow
the company to continuously retrain its demand forecasting model as new
ridership data becomes available. It would also make it easier to evaluate
alternative model architectures, deploy experimental updates, and monitor
Chapter 13. ML Operations 679
13.2.1 DevOps
The term DevOps was coined in 2009 by Patrick Debois, a consultant and Agile
practitioner who organized the first DevOpsDays conference in Ghent, Belgium.
DevOps extended the principles of the Agile movement, that emphasized
close collaboration among development teams and rapid, iterative releases, by
bringing IT operations into the fold.
In traditional software pipelines, development and operations teams often
worked in silos, leading to inefÏciencies, delays, and misaligned priorities.
DevOps emerged as a response, advocating for shared ownership, infrastructure
as code, and the use of automation to streamline deployment pipelines. Tools
0
such as Jenkins, Docker, and Kubernetes became foundational to implementing Continuous Integration/Con-
continuous integration and continuous delivery (CI/CD)0 practices. tinuous Delivery (CI/CD): Practices
DevOps promotes collaboration through automation and feedback loops, that automate the software delivery
aiming to reduce time-to-release and improve software reliability. It established process to ensure a seamless and fre-
the cultural and technical groundwork for extending similar principles to the quent release cycle.
ML domain.
13.2. Historical Context 680
13.2.2 MLOps
MLOps builds on the DevOps foundation but adapts it to the specific demands
of ML system development and deployment. While DevOps focuses on in-
tegrating and delivering deterministic software, MLOps must manage non-
deterministic, data-dependent workflows. These workflows span data acquisi-
tion, preprocessing, model training, evaluation, deployment, and continuous
monitoring (see Figure 13.2).
Definition of MLOps
highlighting the need for platforms that offer optimized, modular, and reusable
infrastructure. Together, these challenges form the foundation for MLOps
practices that focus on automation, collaboration, and lifecycle management.
These challenges introduced the need for a new set of tools and workflows
tailored to the ML lifecycle. While DevOps primarily unifies software develop-
ment and IT operations, MLOps requires coordination across a broader set of
stakeholders—data scientists, ML engineers, data engineers, and operations
teams.
MLOps introduces specialized practices such as data versioning, model ver-
sioning, and model monitoring that go beyond the scope of DevOps. It empha-
sizes scalable experimentation, reproducibility, governance, and responsiveness
to evolving data conditions. Table 13.1 summarizes key similarities and differ-
ences between DevOps and MLOps:
MLOps
•••
Deployment
Model Serving
fine-grained access control, making them well-suited for managing both raw
and processed data artifacts. These systems frequently serve as the foundation
for downstream analytics, model development, and deployment workflows.
To transition from raw data to analysis- or inference-ready formats, MLOps
teams construct automated data pipelines. These pipelines perform structured
tasks such as data ingestion, schema validation, deduplication, transformation,
and loading. Orchestration tools including Apache Airflow, Prefect, and dbt
are commonly used to define and manage these workflows. When managed as
code, pipelines support versioning, modularity, and integration with CI/CD
systems.
An increasingly important element of the MLOps data infrastructure is the
feature store1 . Feature stores, such as Feast and Tecton, provide a centralized
1
repository for storing and retrieving engineered features. These systems serve Feature Store: A centralized
both batch and online use cases, ensuring that models access the same feature repository for storing, managing,
definitions during training and inference, thereby improving consistency and and retrieving feature data used in
reducing data leakage. machine learning models.
Consider a predictive maintenance application in an industrial setting. A
continuous stream of sensor data is ingested and joined with historical mainte-
nance logs through a scheduled pipeline managed in Airflow. The resulting
features, including rolling averages and statistical aggregates, are stored in a
feature store for both retraining and low-latency inference. This pipeline is
versioned, monitored, and integrated with the model registry, enabling full
traceability from data to deployed model predictions.
Effective data management in MLOps is not limited to ensuring data quality. Watch on YouTube
It also establishes the operational backbone that enables model reproducibility, Data Pipelines
auditability, and sustained deployment at scale. Without robust data manage-
ment, the integrity of downstream training, evaluation, and serving processes
cannot be maintained.
TV Watch on YouTube
2
Training-serving skew: A
discrepancy between model perfor-
13.3.1.2 Feature Stores
mance during training and infer-
Feature stores provide an abstraction layer between data engineering and ma- ence, often due to differences in data
chine learning. Their primary purpose is to enable consistent, reliable access to handling.
engineered features across training and inference workflows. In conventional
pipelines, feature engineering logic may be duplicated, manually reimple- 3
Data leakage: Occurs when in-
mented, or diverge across environments. This introduces risks of training- formation from outside the training
serving skew2 , data leakage3 , and model drift4 . dataset is used to create the model,
Feature stores address these challenges by managing both ofÒine (batch) and leading to misleadingly high perfor-
online (real-time) feature access in a centralized repository. During training, mance.
features are computed and stored in a batch environment—typically in con-
4
junction with historical labels. At inference time, the same transformation logic Model drift: The change
is applied to fresh data in an online serving system. This architecture ensures in model performance over time,
that models consume identical features in both contexts, promoting consistency caused by evolving underlying data
and improving reliability. patterns.
13.3. MLOps Key Components 684
ML metadata
& artifact
repository
Figure 13.4: MLOps CI/CD dia-
Retraining gram. Source: HarvardX.
trigger Trained pipeline
metadata
& artifacts
<\>
trigger a retraining job when new labeled data becomes available. The resulting
model is evaluated against baseline metrics, and if performance thresholds are
met, it is deployed automatically.
The increasing availability of cloud-based infrastructure has further expanded
the reach of model training. Cloud providers offer managed services6 that pro-
6
vision high-performance computing resources, which include GPU and TPU In cloud computing, man-
accelerators, on demand. Depending on the platform, teams may construct aged services involve third-party
their own training workflows or rely on fully managed services such as Vertex providers handling infrastructure,
AI Fine Tuning, which support automated adaptation of foundation models to application functionalities, and op-
new tasks. Nonetheless, hardware availability, regional access restrictions, and erations.
cost constraints remain important considerations when designing cloud-based
training systems.
As an illustrative example, consider a data scientist developing a convolu-
tional neural network (CNN) for image classification using a PyTorch notebook.
The fastai library is used to simplify model construction and training. The
notebook trains the model on a labeled dataset, computes performance metrics,
and tunes hyperparameters such as learning rate and architecture depth. Once
validated, the training script is version-controlled and incorporated into a re-
training pipeline that is periodically triggered based on data updates or model
performance monitoring.
Through standardized workflows, versioned environments, and automated
orchestration, MLOps enables the model training process to transition from ad
hoc experimentation to a robust, repeatable, and scalable system. This not only
accelerates development but also ensures that trained models meet production
standards for reliability, traceability, and performance.
Incoming date
Online store
Offline store
metrics are monitored to assess system stability and user impact. For instance,
an e-commerce platform may deploy a new recommendation model to 5% of
web trafÏc and observe metrics such as click-through rate, latency, and pre-
diction accuracy. Only after the model demonstrates consistent and reliable
performance is it promoted to full production.
Cloud-based ML platforms further support model evaluation by enabling
experiment logging, request replay, and synthetic test case generation. These
capabilities allow teams to evaluate different models under identical condi-
tions, facilitating comparisons and root-cause analysis. Tools such as Weights
and Biases automate aspects of this process by capturing training artifacts,
recording hyperparameter configurations, and visualizing performance metrics
across experiments. These tools integrate directly into training and deployment
pipelines, improving transparency and traceability.
While automation is central to MLOps evaluation practices, human oversight
remains essential. Automated tests may fail to capture nuanced performance
issues, such as poor generalization on rare subpopulations or shifts in user
behavior. Therefore, teams often combine quantitative evaluation with qual-
itative review, particularly for models deployed in high-stakes or regulated
environments.
In summary, model evaluation within MLOps is a multi-stage process that
bridges ofÒine testing and live system monitoring. It ensures that models not
only meet technical benchmarks but also behave predictably and responsibly
under real-world conditions. These evaluation practices reduce deployment
risk and help maintain the reliability of machine learning systems over time.
systems may process tens of trillions of inference queries per day (C.-J. Wu
et al. 2019). Meeting such demand requires careful design to balance latency,
scalability, and robustness.
To address these challenges, production-grade serving frameworks have
emerged. Tools such as TensorFlow Serving, NVIDIA Triton Inference Server,
and KServe provide standardized mechanisms for deploying, versioning, and
scaling machine learning models across heterogeneous infrastructure. These
frameworks abstract many of the lower-level concerns, allowing teams to focus
on system behavior, integration, and performance targets.
Model serving architectures are typically designed around three broad paradigms:
1. Online Serving, which provides low-latency, real-time predictions for
interactive systems such as recommendation engines or fraud detection.
2. OfÒine Serving, which processes large batches of data asynchronously,
typically in scheduled jobs used for reporting or model retraining.
3. Near-Online (Semi-Synchronous) Serving, which offers a balance between
latency and throughput, appropriate for scenarios like chatbots or semi-
interactive analytics.
Together, these strategies form the foundation of robust model serving sys-
tems. When effectively integrated, they enable machine learning applications
to meet performance targets while maintaining system-level efÏciency and
scalability.
Data
Configuration
Collection
Figure 13.6: ML system components.
Source: Sculley et al. (2015) Machine
Data Resource
Verification Management
Process
Serving
ML system Management
Infrastructure
Tools
Feature
ML Code
Extraction
The sections that follow describe key categories of technical debt unique to ML
systems. Each subsection highlights common sources, illustrative examples,
Chapter 13. ML Operations 697
ab ction
lea sis
lec el
tra odel
ati l
ym el
alu de
t
on
t
se Mod
plo Mod
en
en
d c aly
elin
nin
tio
inin
ev Mo
Sta robl
Impacts of cascades
M
an colle
an a an
P
de
Da
Da
These tools enable early detection of unused imports, broken interfaces, and
type mismatches. However, ML systems typically lack equivalent tooling for
analyzing data dependencies, which include everything from feature genera-
tion scripts and data joins to external data sources and labeling conventions.
Without such tools, changes to even a single feature or schema can ripple across
a system without warning.
Two common forms of data dependency debt are unstable inputs and un-
derutilized inputs. Unstable inputs refer to data sources that change over time,
whether in content, structure, or availability, leading to inconsistent model
behavior. A model trained on one version of a feature may produce unexpected
results when that feature’s distribution or encoding changes. Underutilized
inputs refer to data elements included in training pipelines that have little or no
impact on model performance. These features increase complexity, slow down
processing, and increase the surface area for bugs, yet provide little return on
investment.
One approach to managing unstable dependencies is to implement robust
data versioning. By tracking which data snapshot was used for training a given
model, teams can reproduce results and isolate regressions. However, version-
ing also introduces overhead: multiple versions must be stored, managed, and
tested for staleness. For underutilized inputs, a common strategy is to run
leave-one-feature-out evaluations, where features are systematically removed
to assess their contribution to model performance. This analysis can guide
decisions about whether to simplify the feature set or deprecate unused data
streams.
Addressing data dependency debt requires both architectural discipline and
appropriate tooling. ML systems must be designed with traceability in mind—
recording not just what data was used, but where it came from, how it was
transformed, and how it affected model behavior. For example, consider an
e-commerce platform that includes a “days since last login” feature in its churn
prediction model. If the meaning of this feature changes, for instance, if a
platform redesign results in users being automatically logged in through a
companion app, the input distribution will shift, potentially degrading model
performance. Without explicit tracking and validation of this data dependency,
the issue might go unnoticed until accuracy metrics decline in production.
As systems scale, unexamined data dependencies like these become a major
source of brittleness and drift. Investing in structured data practices early in
the lifecycle, including schema validation, lineage tracking, and dependency
testing, can help prevent these issues from compounding over time.
Despite these risks, not all early-stage debt is harmful. The key distinction
lies in whether the system is designed to support evolution. Techniques such
as using modular code, isolating configuration from logic, and containerizing
experimental environments allow teams to move quickly without sacrificing
future maintainability. Abstractions, including shared data access layers and
feature transformation modules, can be introduced incrementally as patterns
stabilize.
To manage early-stage debt effectively, teams should adopt the principle of
flexible foundations: designing for change without over-engineering. This
means identifying which components are likely to evolve and introducing
appropriate boundaries and interfaces early on. As the system matures, natural
inflection points emerge—opportunities to refactor or re-architect without
disrupting existing workflows.
Accepting some technical debt in the short term is often a rational tradeoff.
The challenge is ensuring that such debt is intentional, tracked, and revisited
before it becomes entrenched. By investing in adaptability from the beginning,
ML teams can balance early innovation with long-term sustainability.
ways. This entanglement illustrates undeclared consumer debt and the risks of
skipping strict interface governance in ML-enabled safety-critical systems.
13.4.11 Summary
Technical debt in machine learning systems is both pervasive and distinct
from debt encountered in traditional software engineering. While the original
metaphor of financial debt highlights the tradeoff between speed and long-term
cost, the analogy falls short in capturing the full complexity of ML systems.
In machine learning, debt often arises not only from code shortcuts but also
from entangled data dependencies, poorly understood feedback loops, fragile
pipelines, and configuration sprawl. Unlike financial debt, which can be explic-
itly quantified, ML technical debt is largely hidden, emerging only as systems
scale, evolve, or fail.
This chapter has outlined several forms of ML-specific technical debt, each
rooted in different aspects of the system lifecycle. Boundary erosion undermines
modularity and makes systems difÏcult to reason about. Correction cascades
illustrate how local fixes can ripple through a tightly coupled workflow. Unde-
clared consumers and feedback loops introduce invisible dependencies that
challenge traceability and reproducibility. Data and configuration debt reflect
the fragility of inputs and parameters that are poorly managed, while pipeline
and change adaptation debt expose the risks of inflexible architectures. Early-
stage debt reminds us that even in the exploratory phase, decisions should be
made with an eye toward future extensibility.
The common thread across all these debt types is the need for system-level
thinking. ML systems are not just code—they are evolving ecosystems of
data, models, infrastructure, and teams. Managing technical debt requires
Chapter 13. ML Operations 707
Management
The role of Manager for supporting
the planning and execution of
Analytics various Data Science.
Creates model for solving various Figure 13.8: Comparison of model-
data analytics problems.
centric and data-centric AI ap-
proaches. Model-centric AI fo-
Strategy cuses on improving architectures,
Designing new strategies by while data-centric AI emphasizes
understanding the consumers’
trends and behaviours. enhancing dataset quality. Both ap-
proaches are complementary in op-
Collaboration timizing AI performance.
Collaborating with many senior
people like Data Scientists,
Stakeholder, etc...
Other Duties
Duties assigned by the senior Data
Scientist, Chief Data Officer.
MLOps provides the structure and practices necessary to align these special-
ized roles around a shared objective: delivering reliable, scalable, and maintain-
able machine learning systems in production environments. From designing
robust data pipelines to deploying and monitoring models in live systems,
effective MLOps depends on collaboration across disciplines including data
engineering, statistical modeling, software development, infrastructure man-
agement, and project coordination.
13.5.1 Roles
Table 13.3 introduces the key roles that participate in MLOps and outlines
their primary responsibilities. Understanding these roles not only clarifies the
scope of skills required to support production ML systems but also helps frame
the collaborative workflows and handoffs that drive the operational success of
machine learning at scale.
13.5. Roles and Responsibilities 708
Table 13.3: MLOps roles and responsibilities across the machine learning lifecy-
cle.
MLOps Lifecycle
Role Primary Focus Core Responsibilities Summary Alignment
Data Engineer Data preparation and Build and maintain pipelines; ensure Data ingestion,
infrastructure quality, structure, and lineage of data transformation
Data Scientist Model development and Formulate tasks; build and evaluate Modeling and
experimentation models; iterate using feedback and evaluation
error analysis
ML Engineer Production integration Operationalize models; implement Deployment and
and scalability serving logic; manage performance inference
and retraining
DevOps Infrastructure Manage compute infrastructure; Training,
Engineer orchestration and implement CI/CD; monitor systems deployment,
automation and workflows monitoring
Project Coordination and delivery Align goals; manage schedules and Planning and
Manager oversight milestones; enable cross-team integration
execution
Responsible AI Ethics, fairness, and Monitor bias and fairness; enforce Evaluation and
Lead governance transparency and compliance governance
standards
Security & System protection and Secure data and models; implement Data handling and
Privacy data integrity privacy controls; ensure system compliance
Engineer resilience
ing compute clusters, and maintaining metadata catalogs that document data
schemas, lineage, and access controls. To ensure reproducibility and gover-
nance, data engineers implement dataset versioning, maintain historical snap-
shots, and enforce data retention and auditing policies.
For example, in a manufacturing application, data engineers may construct
an Airflow pipeline that ingests time-series sensor data from programmable
logic controllers (PLCs)10 on the factory floor. The raw data is cleaned, joined
10
with product metadata, and aggregated into statistical features such as rolling Programmable Logic Con-
averages and thresholds. The processed features are stored in a Snowflake data troller (PLC): An industrial com-
warehouse, where they are consumed by downstream modeling and inference puter used to control manufacturing
workflows. processes, such as robotic devices or
Through their design and maintenance of robust data infrastructure, data assembly lines.
engineers enable the consistent and efÏcient delivery of high-quality data.
Their contributions ensure that machine learning systems are built on reliable
inputs, supporting reproducibility, scalability, and operational stability across
the MLOps pipeline.
To illustrate this responsibility in practice, Listing 13.1 shows a simplified
example of a daily Extract-Transform-Load (ETL) pipeline implemented using
Apache Airflow. This workflow automates the ingestion and transformation of
raw sensor data, preparing it for downstream machine learning tasks.
Listing 13.1: Code in Practice for a Data Engineer, implementing a daily Extract-
Transform-Load (ETL) pipeline using Apache Airflow to process manufacturing
sensor data.
# Airflow DAG for daily ETL from a manufacturing data source
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def extract_data():
import pandas as pd
df = pd.read_csv('/data/raw/plc_logs.csv') # Simulated
# PLC data
df.to_parquet('/data/staged/sensor_data.parquet')
def transform_data():
import pandas as pd
df = pd.read_parquet('/data/staged/sensor_data.parquet')
df['rolling_avg'] = (
df['temperature']
.rolling(window=10)
.mean()
)
df.to_parquet('/data/processed/features.parquet')
with DAG(
dag_id='manufacturing_etl_pipeline',
schedule_interval='@daily',
start_date=datetime(2023, 1, 1),
catchup=False
) as dag:
extract = PythonOperator(
task_id='extract',
python_callable=extract_data
)
transform = PythonOperator(
task_id='transform',
python_callable=transform_data
)
13.5.1.3 ML Engineers
Machine learning engineers are responsible for translating experimental models
into reliable, scalable systems that can be integrated into real-world applica-
tions. Positioned at the intersection of data science and software engineering,
ML engineers ensure that models developed in research environments can
be deployed, monitored, and maintained within production-grade infrastruc-
ture. Their work bridges the gap between prototyping and operationalization,
enabling machine learning to deliver sustained value in practice.
A core responsibility of ML engineers is to take trained models and encap-
sulate them within modular, maintainable components. This often involves
refactoring code for robustness, implementing model interfaces, and building
application programming interfaces (APIs) that expose model predictions to
downstream systems. Frameworks such as Flask and FastAPI are commonly
used to construct lightweight, RESTful services11 for model inference. To sup- 11
RESTful Services: Web ser-
port portability and environment consistency, models, and their dependencies vices implementing REST (Repre-
are typically containerized using Docker and managed within orchestration sentational State Transfer) princi-
systems like Kubernetes. ples for networked applications.
ML engineers also oversee the integration of models into continuous inte-
gration and continuous delivery (CI/CD) pipelines. These pipelines automate
13.5. Roles and Responsibilities 712
model = models.Sequential([
layers.Input(shape=(30, 5)), # 30 time steps, 5 features
layers.LSTM(64),
layers.Dense(1)
])
the retraining, testing, and deployment of models, ensuring that updated mod-
els are validated against performance benchmarks before being promoted to
production. Practices such as canary deployments, A/B testing, and staged
rollouts allow for gradual transitions and reduce the risk of regressions. In the
event of model degradation, rollback procedures are used to restore previously
validated versions.
Operational efÏciency is another key area of focus. ML engineers apply a
range of optimization techniques, including model quantization, pruning, and
batch serving, to meet latency, throughput, and cost constraints. In systems that
support multiple models, they may implement mechanisms for dynamic model
selection or concurrent serving. These optimizations are closely coupled with
infrastructure provisioning, which often includes the configuration of GPUs or
other specialized accelerators.
Post-deployment, ML engineers play a critical role in monitoring model be-
havior. They configure telemetry systems to track latency, failure rates, and
resource usage, and they instrument prediction pipelines with logging and
alerting mechanisms. In collaboration with data scientists and DevOps engi-
neers, they respond to changes in system behavior, trigger retraining workflows,
12 and ensure that models continue to meet service-level objectives12 .
Service-Level Objectives
(SLOs): Specific measurable charac-
For example, consider a financial services application where a data science
teristics of the SLAs such as avail-
team has developed a fraud detection model using TensorFlow. An ML engi-
ability, throughput, frequency, re-
neer packages the model for deployment using TensorFlow Serving, configures
sponse time, or quality.
a REST API for integration with the transaction pipeline, and sets up a CI/CD
pipeline in Jenkins to automate updates. They implement logging and monitor-
Chapter 13. ML Operations 713
ing using Prometheus and Grafana, and configure rollback logic to revert to the
prior model version if performance deteriorates. This production infrastruc-
ture enables the model to operate continuously and reliably under real-world
workloads.
Through their focus on software robustness, deployment automation, and
operational monitoring, ML engineers play a pivotal role in transitioning ma-
chine learning models from experimental artifacts into trusted components of
production systems. To illustrate these responsibilities in a practical context,
Listing 13.3 presents a minimal example of a REST API built with FastAPI for
serving a trained TensorFlow model. This service exposes model predictions
for use in downstream applications.
app = FastAPI()
model = tf.keras.models.load_model('models/demand_forecast_v1')
@app.post("/predict")
async def predict(request: Request):
data = await request.json()
input_array = np.array(data['input']).reshape(1, 30, 5)
prediction = model.predict(input_array)
return {"prediction": float(prediction[0][0])}
boot_disk {
initialize_params {
image = "debian-cloud/debian-11"
}
}
guest_accelerator {
type = "nvidia-tesla-t4"
count = 1
}
metadata_startup_script = <<-EOF
sudo apt-get update
sudo apt-get install -y docker.io
sudo docker run --gpus all -p 8501:8501 tensorflow/serving
EOF
tags = ["ml-serving"]
}
{
"project": "Churn Prediction",
"milestones": [
{
"name": "Data Pipeline Ready",
"due": "2025-05-01",
"status": "Complete"
},
{
"name": "Model Baseline",
"due": "2025-05-10",
"status": "In Progress"
},
{
"name": "Staging Deployment",
"due": "2025-05-15",
"status": "Pending"
},
{
"name": "Production Launch",
"due": "2025-05-25",
"status": "Pending"
}
],
"risks": [
{
"issue": "Delayed cloud quota",
"mitigation": "Request early from infra team"
}
]
}
tential harms. These assessments are often incorporated into model validation
pipelines to ensure that they are systematically enforced before deployment.
In post-deployment settings, Responsible AI Leads help monitor systems for
drift, bias amplification, and unanticipated behavior. They may also oversee
the creation of documentation artifacts such as model cards or datasheets for
datasets, which serve as tools for transparency and reproducibility. In regulated
sectors, this role collaborates with legal and compliance teams to meet audit
requirements and ensure that deployed models remain aligned with external
mandates.
For example, in a hiring recommendation system, a Responsible AI Lead may
oversee an audit that compares model outcomes across gender and ethnicity,
guiding the team to adjust the training pipeline to reduce disparities while
13.5. Roles and Responsibilities 718
preserving predictive accuracy. They also ensure that decision rationales are
documented and reviewable by both technical and non-technical stakeholders.
The integration of ethical review and governance into the ML development
process enables the Responsible AI Lead to support systems that are not only
technically robust, but also socially responsible and institutionally accountable.
To illustrate these responsibilities in a practical context, Listing 13.6 presents
an example of using the Aequitas library to audit a model for group-based
disparities. This example evaluates statistical parity across demographic groups
to assess potential fairness concerns prior to deployment.
print(b[
['attribute_name',
'attribute_value',
'disparity',
'statistical_parity']
])
Listing 13.7: Code in Practice for a Security and Privacy Engineer, applying dif-
ferential privacy during model training to protect sensitive data while enabling
predictive performance.
model.compile(
optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy']
)
Importantly, as Table 13.4 indicates, the boundaries between roles are not
rigid. Effective MLOps practices rely on shared understanding, documenta-
tion, and tools that facilitate communication and coordination across teams.
Encouraging interdisciplinary fluency, including enabling data scientists to
understand deployment workflows and DevOps engineers to interpret model
monitoring metrics, enhances organizational agility and resilience.
As machine learning becomes increasingly central to modern software sys-
tems, roles will continue to adapt in response to emerging tools, methodologies,
and system architectures. Recognizing the dynamic nature of these responsibil-
ities allows teams to allocate resources effectively, design adaptable workflows,
and foster collaboration that is essential for sustained success in production-
scale machine learning.
and retraining workflows become more systematic. Systems at this level can
support limited scale and iteration but still rely heavily on human coordination.
At the highest levels of maturity, ML systems are fully integrated with
infrastructure-as-code, continuous delivery pipelines, and automated mon-
itoring. Data lineage, feature reuse, and model validation are encoded into
the development process. Governance is embedded throughout the system,
allowing for traceability, auditing, and policy enforcement. These environ-
ments support large-scale deployment, rapid experimentation, and adaptation
to changing data and system conditions.
This progression, summarized in Table 13.5, offers a system-level framework
for analyzing ML operational practices. It emphasizes architectural cohesion
and lifecycle integration over tool selection, guiding the design of scalable and
maintainable learning systems.
Service health
Figure 13.9: How ML service up-
time is supported by an “iceberg”
of underlying components to moni-
UPTIME tor.
MODEL ACCURACY
Data health DATA DRIFT Model health
CONCEPT DRIFT
BROKEN PIPELINES
SCHEMA CHANGE
MODEL BIAS
DATA OUTAGE
UNDERPERFORMING
SEGMENTS
workflows. As such, the case illustrates the importance of robust MLOps prac-
tices, particularly when operating under the constraints of embedded systems.
the original system’s 62% correlation and approaching the clinical benchmark
of 82–83%.
These performance gains did not result solely from architectural innovation.
Instead, they reflect the broader impact of a systematic MLOps approach—one
that integrated rigorous data collection, reproducible training pipelines, and
disciplined evaluation practices. This phase underscores the importance of
aligning model development with both application constraints and system-
level reliability, particularly in embedded ML environments where deployment
feasibility is as critical as accuracy.
Doctor AI developer
reducing clinical burden. More significant changes require review and approval
by a healthcare provider. This structure maintains human oversight while
enabling high-frequency, data-driven adaptation of therapies.
By enabling real-time, tailored interventions, including automatic insulin
dosing adjustments based on glucose trends, this loop exemplifies how machine
learning can close the feedback gap between sensing and treatment, allowing
for dynamic, context-aware care outside of traditional clinical settings.
Clinician-AI Loop. The clinician–AI loop introduces a critical layer of human
oversight into the process of AI-assisted therapeutic decision-making. In this
loop, the AI system generates treatment recommendations and presents them
to the clinician along with concise, interpretable summaries of the underlying
patient data. These summaries may include longitudinal trends, sensor-derived
20
metrics, and contextual factors extracted from the electronic health record20 .
Electronic Health Record For example, an AI model might recommend a reduction in antihypertensive
(EHR): A digital system that stores medication dosage for a patient whose blood pressure has remained consistently
patient health information, used below target thresholds. The clinician reviews the recommendation in the
across treatment settings. context of the patient’s broader clinical profile and may choose to accept, reject,
or modify the proposed change. This feedback, in turn, contributes to the
continuous refinement of the model, improving its alignment with clinical
practice.
Crucially, clinicians also define the operational boundaries within which the
AI system can autonomously issue recommendations. These constraints ensure
that only low-risk adjustments are automated, while more significant decisions
require human approval. This preserves clinical accountability, supports patient
safety, and enhances trust in AI-supported workflows.
The clinician–AI loop exemplifies a hybrid model of care in which AI aug-
ments rather than replaces human expertise. By enabling efÏcient review and
oversight of algorithmic outputs, it facilitates the integration of machine intelli-
gence into clinical practice while preserving the role of the clinician as the final
decision-maker.
Patient-Clinician Loop. The patient–clinician loop enhances the quality of
clinical interactions by shifting the focus from routine data collection to higher-
level interpretation and shared decision-making. With AI systems handling
data aggregation and basic trend analysis, clinicians are freed to engage more
meaningfully with patients—reviewing patterns, contextualizing insights, and
setting personalized health goals.
For example, in managing diabetes, a clinician may use AI-summarized data
to guide a discussion on dietary habits and physical activity, tailoring recom-
mendations to the patient’s specific glycemic trends. Rather than adhering to
fixed follow-up intervals, visit frequency can be adjusted dynamically based on
patient progress and stability, ensuring that care delivery remains responsive
and efÏcient.
This feedback loop positions the clinician not merely as a prescriber but as a
coach and advisor, interpreting data through the lens of patient preferences,
21
The partnership formed be- lifestyle, and clinical judgment. It reinforces the therapeutic alliance21 by foster-
tween a clinician and a patient that ing collaboration and mutual understanding—key elements in personalized
enhances treatment effectiveness. and patient-centered care.
Chapter 13. ML Operations 735
integrating this observational data into subsequent training iterations, the sys-
tem incrementally improves its predictive accuracy and clinical utility. The
overarching objective is to enable fully personalized, adaptive blood pressure
management that evolves in response to each patient’s physiological and be-
havioral profile.
Patient-AI Loop. The patient-AI loop facilitates timely, personalized medica-
tion adjustments by delivering AI-generated recommendations directly to the
patient through a wearable device or associated mobile application. When the
model identifies a minor dosage modification that falls within a pre-approved
safety envelope, the patient may act on the suggestion independently, enabling
a form of autonomous, yet bounded, therapeutic self-management.
For recommendations involving significant changes to the prescribed regi-
men, the system defers to clinician oversight, ensuring medical accountability
and compliance with regulatory standards. This loop empowers patients to
engage actively in their care while maintaining a safeguard for clinical appro-
priateness.
By enabling personalized, data-driven feedback on a daily basis, the patient-
AI loop supports improved adherence and therapeutic outcomes. It opera-
tionalizes a key principle of ClinAIOps, by closing the loop between continuous
monitoring and adaptive intervention, while preserving the patient’s role as an
active agent in the treatment process.
Clinician-AI Loop. The clinician-AI loop ensures medical oversight by plac-
ing healthcare providers at the center of the decision-making process. Clini-
cians receive structured summaries of the patient’s longitudinal blood pressure
patterns, visualizations of adherence behaviors, and relevant contextual data
aggregated from wearable sensors and electronic health records. These in-
sights support efÏcient and informed review of the AI system’s recommended
medication adjustments.
Before reaching the patient, the clinician evaluates each proposed dosage
change, choosing to approve, modify, or reject the recommendation based on
their professional judgment and understanding of the patient’s broader clinical
profile. Furthermore, clinicians define the operational boundaries within which
the AI may act autonomously, specifying thresholds for dosage changes that
can be enacted without direct review.
When the system detects blood pressure trends indicative of clinical risk,
22 including persistent hypotension or a hypertensive crisis22 , it generates alerts
Hypertensive Crisis: A severe
for immediate clinician intervention. These capabilities preserve the clinician’s
increase in blood pressure that can
authority over treatment while enhancing their ability to manage patient care
lead to stroke, heart attack, or other
proactively and at scale.
critical conditions.
This loop exemplifies the principles of accountability, safety, and human-in-
23 the-loop23 governance, ensuring that AI functions as a supportive tool rather
A model of operation in
than an autonomous agent in therapeutic decision-making.
which human decision-makers are
involved directly in the AI decision- Patient-Clinician Loop. As illustrated in Figure 13.11, the patient-clinician
making pathway. loop emphasizes collaboration, context, and continuity in care. Rather than
devoting in-person visits to basic data collection or medication reconciliation,
clinicians engage with patients to interpret high-level trends derived from
Chapter 13. ML Operations 737
As shown in Table 13.6, the key distinction lies in how ClinAIOps integrates
technical systems with human oversight, ethical principles, and care delivery
processes. Rather than replacing clinicians, the framework augments their
capabilities while preserving their central role in therapeutic decision-making.
13.8 Conclusion
The operationalization of machine learning is a complex, systems-oriented
endeavor that extends far beyond training and deploying models. MLOps pro-
vides the methodological and infrastructural foundation for managing the full
lifecycle of ML systems—from data collection and preprocessing to deployment,
monitoring, and continuous refinement. By drawing on principles from soft-
ware engineering, DevOps, and data science, MLOps offers the practices needed
to achieve scalability, reliability, and resilience in real-world environments.
This chapter has examined the core components of MLOps, highlighting
key challenges such as data quality, reproducibility, infrastructure automa-
tion, and organizational coordination. We have emphasized the importance of
operational maturity, where model-centric development evolves into system-
level engineering supported by robust processes, tooling, and feedback loops.
Through detailed case studies in domains such as wearable computing and
healthcare, we have seen how MLOps must adapt to specific operational con-
texts, technical constraints, and stakeholder ecosystems.
As we transition to subsequent chapters, we shift our focus toward emerging
frontiers in operational practice, including on-device learning, privacy and
security, responsible AI, and sustainable systems. Each of these domains in-
troduces unique constraints that further shape how machine learning must be
engineered and maintained in practice. These topics build on the foundation
laid by MLOps, extending it into specialized operational regimes.
Ultimately, operational excellence in machine learning is not a fixed endpoint
but a continuous journey. It requires cross-disciplinary collaboration, rigorous
engineering, and a commitment to long-term impact. By approaching ML sys-
tems through the lens of MLOps, which are grounded in systems thinking and
guided by ethical and societal considerations, we can build solutions that are
not only technically sound but also trustworthy, maintainable, and meaningful
in their real-world applications.
As the chapters ahead explore these evolving dimensions of machine learning
systems, the central lesson remains clear: building models is only the begin-
ning. The enduring challenge and opportunity lies in building systems that are
adaptive, responsible, and effective in the face of complexity, uncertainty, and
change.
13.9 Resources
Slides
çĖ Videos
• Video 5
• Video 6
• Video 7
¸Î Exercises
• Coming soon.
Chapter 14
On-Device Learning
Purpose
How does enabling learning directly on edge devices reshape machine learning system
design, and what strategies support adaptation under resource constraints?
The shift toward on-device learning marks a significant evolution in the
deployment and maintenance of machine learning systems. Rather than re-
lying exclusively on centralized infrastructure, models are now increasingly
expected to adapt in situ—updating and improving directly on the devices
where they operate. This approach introduces a new design space, where
training must occur within stringent constraints on memory, compute, energy,
and data availability. In these settings, the balance between model adaptability,
system efÏciency, and deployment scalability becomes critical. This chapter
examines the architectural, algorithmic, and infrastructure-level techniques
that enable effective learning on the edge, and outlines the principles required
741
14.1. Overview 742
L Learning Objectives
14.1 Overview
Machine learning systems have traditionally treated model training and model
inference as distinct phases, often separated by both time and infrastructure.
Training occurs in the cloud, leveraging large-scale compute clusters and cu-
rated datasets, while inference is performed downstream on deployed models—
typically on user devices or edge servers. However, this separation is beginning
to erode. Increasingly, devices are being equipped not just to run inference, but
to adapt, personalize, and improve models locally.
On-device learning refers to the process of training or adapting machine
learning models directly on the device where they are deployed. This capability
opens the door to systems that can personalize models in response to user
behavior, operate without cloud connectivity, and respect stringent privacy
constraints by keeping data local. It also introduces a new set of challenges:
devices have limited memory, computational power, and energy. Furthermore,
training data is often sparse, noisy, or non-independent across users. These
limitations necessitate a fundamental rethinking of training algorithms, system
architecture, and deployment strategies.
This chapter explores the principles and systems design considerations under-
pinning on-device learning. It begins by examining the motivating applications
that necessitate learning on the device, followed by a discussion of the unique
hardware constraints introduced by embedded and mobile environments. The
chapter then develops a taxonomy of strategies for adapting models, algorithms,
and data pipelines to these constraints. Particular emphasis is placed on dis-
tributed and collaborative methods, such as federated learning, which enable
decentralized training without direct data sharing. The chapter concludes with
an analysis of outstanding challenges, including issues related to reliability,
system validation, and the heterogeneity of deployment environments.
Wearable and health monitoring devices also present strong use cases. These
systems often rely on real-time data from accelerometers, heart rate sensors,
or electrodermal activity monitors. However, physiological baselines vary
significantly between individuals. On-device learning allows models to adapt to
these baselines over time, improving the accuracy of activity recognition, stress
detection, and sleep staging. Moreover, in regulated healthcare environments,
patient data must remain localized due to privacy laws, further reinforcing the
need for edge-local adaptation.
Wake-word detection and voice interfaces illustrate another critical scenario.
Devices such as smart speakers and earbuds must recognize voice commands
14.2. Deployment Drivers 746
B.
Figure 14.5 illustrates a pipeline that combines ofÒine pre-training with online
adaptive learning on resource-constrained IoT devices. The system first un-
dergoes meta-training with generic data. During deployment, device-specific
constraints such as data availability, compute, and memory shape the adap-
tation strategy by ranking and selecting layers and channels to update. This
enables efÏcient on-device learning within limited resource envelopes.
Pre-trained Si
backbone
Meta-training with generic Rank the channels and layers based Train the selected
data (e.g. MiniImageNet) on the multi-objective metric Si layers and channels
Noise and variability further degrade data quality. Embedded systems such
as environmental sensors or automotive ECUs6 may experience fluctuations in
6
sensor calibration, environmental interference, or mechanical wear, leading to Electronic Control Unit (ECU):
corrupted or drifting input signals over time. Without centralized validation, A device that controls one or more
these errors may silently degrade learning performance if not detected and of the electrical systems or subsys-
filtered appropriately. tems in a vehicle.
Finally, data privacy and security concerns are paramount in many on-device
learning applications. Sensitive information, such as health data or user in-
teractions, must be protected from unauthorized access. This requirement
often precludes the use of traditional data-sharing methods, such as upload-
ing raw data to a central server for training. Instead, on-device learning must
rely on techniques that allow for local adaptation without exposing sensitive
information.
𝑦 = 𝑊𝑥+𝑏
𝜕ℒ 𝜕ℒ
= 0, ≠0
𝜕𝑊 𝜕𝑏
so that only the bias is updated via gradient descent:
𝜕ℒ
𝑏 ← 𝑏−𝜂
𝜕𝑏
This drastically reduces the number of stored gradients and optimizer states,
enabling training to proceed even under memory-constrained conditions. On
embedded devices that lack floating-point units, this reduction can be critical
to enabling on-device learning at all.
The code snippet in Listing 14.1 demonstrates how to implement bias-only
adaptation in PyTorch.
This pattern ensures that only bias terms participate in the backward pass and
optimizer update. It is particularly useful when adapting pretrained models to
user-specific or device-local data.
This technique underpins TinyTL, a framework explicitly designed to enable
efÏcient adaptation of deep neural networks on microcontrollers and other
memory-limited platforms. Rather than updating all network parameters dur-
ing training, TinyTL freezes both the convolutional weights and the batch nor-
malization statistics, training only the bias terms and, in some cases, lightweight
residual components. This architectural shift drastically reduces memory us-
age during backpropagation, since the largest tensors, which are intermediate
activations, no longer need to be stored for gradient computation.
Figure 14.6 illustrates the architectural differences between a standard model
and the TinyTL approach. In the conventional baseline architecture, all layers
are trainable, and backpropagation requires storing intermediate activations
for the full network. This significantly increases the memory footprint, which
quickly becomes infeasible on edge devices with only a few hundred kilobytes
of SRAM.
In contrast, the TinyTL architecture freezes all weights and updates only
the bias terms inserted after convolutional layers. These bias modules are
lightweight and require minimal memory, enabling efÏcient training with a
14.4. Model Adaptation 754
added components are optimized. This modularity makes the approach well-
suited for on-device adaptation in constrained settings, where small updates
must deliver meaningful changes.
ℎ′ = ℎ + 𝐴(ℎ)
𝐴(ℎ) = 𝑊2 𝜎(𝑊1 ℎ)
where 𝑈 ∈ ℝ𝑚×𝑟 and 𝑉 ∈ ℝ𝑛×𝑟 , with 𝑟 ≪ min(𝑚, 𝑛). This reduces the number
of trainable parameters from 𝑚𝑛 to 𝑟(𝑚 + 𝑛). During adaptation, the new
weight is computed as:
𝑊adapted = 𝑊frozen + 𝑈 𝑉 ⊤
class Adapter(nn.Module):
def __init__(self, dim, bottleneck_dim):
super().__init__()
self.down = nn.Linear(dim, bottleneck_dim)
self.up = nn.Linear(bottleneck_dim, dim)
self.activation = nn.ReLU()
14.4.2.4 Tradeoffs
Residual and low-rank updates strike a balance between expressivity and ef-
ficiency. Compared to bias-only learning, they can model more substantial
deviations from the pretrained task. However, they require more memory and
compute—both for training and inference.
When considering residual and low-rank updates for on-device learning, sev-
eral important tradeoffs emerge. First, these methods consistently demonstrate
superior adaptation quality compared to bias-only approaches, particularly
11 when deployed in scenarios involving significant distribution shifts11 from the
Distribution shifts refer to
original training data (Quiñonero-Candela et al. 2008). This improved adapt-
changes in the input data’s charac-
ability stems from their increased parameter capacity and ability to learn more
teristics, which can affect model per-
complex transformations.
formance when different from the
However, this enhanced adaptability comes at a cost. The introduction of
training data.
additional layers or parameters inevitably increases both memory requirements
and computational latency during forward and backward passes. While these
increases are modest compared to full model training, they must be carefully
12
Dynamic Computation considered when deploying to resource-constrained devices.
Graphs: Structures that allow Additionally, implementing these adaptation techniques requires system-
changes during runtime, enabling level support for dynamic computation graphs12 and the ability to selectively
models to adapt structures based inject trainable parameters. Not all deployment environments or inference
on input data. engines may support such capabilities out of the box.
Chapter 14. On-Device Learning 757
𝜕ℒ
𝜃𝑖 − 𝜂 𝜕𝜃 , if 𝑖 ∈ 𝒮
𝜃𝑖 ← { 𝑖
𝜃𝑖 , otherwise
The challenge lies in selecting the optimal subset 𝒮 given memory and com-
pute constraints.
4. Rank layers by performance gain per unit cost (e.g., per KB of trainable
memory).
This layer-wise profiling yields a ranking from which 𝒮 can be constructed
subject to a memory budget.
A concrete example is TinyTrain, a method designed to enable rapid adapta-
tion on-device (C. Deng, Zhang, and Wu 2022). TinyTrain pretrains a model
along with meta-gradients that capture which layers are most sensitive to new
tasks. At runtime, the system dynamically selects layers to update based on
task characteristics and available resources.
This pattern can be extended with profiling logic to select layers based on
contribution scores or hardware profiles, as shown in Listing 14.3.
14.4.3.5 Tradeoffs
Task-adaptive sparse updates introduce several important system-level consid-
erations that must be carefully balanced. First, the overhead of contribution
analysis, although primarily incurred during pretraining or initial profiling,
represents a non-trivial computational cost. This overhead is typically accept-
able since it occurs ofÒine, but it must be factored into the overall system design
and deployment pipeline.
Chapter 14. On-Device Learning 759
Second, the stability of the adaptation process becomes critical when working
with sparse updates. If too few parameters are selected for updating, the model
may underfit the target distribution, failing to capture important local varia-
tions. This suggests the need for careful validation of the selected parameter
subset before deployment, potentially incorporating minimum thresholds for
adaptation capacity.
Third, the selection of updateable parameters must account for hardware-
specific characteristics of the target platform. Beyond just considering gradient
magnitudes, the system must evaluate the actual execution cost of updating
specific layers on the deployed hardware. Some parameters might show high
contribution scores but prove expensive to update on certain architectures,
requiring a more nuanced selection strategy that balances statistical utility with
runtime efÏciency.
Despite these tradeoffs, task-adaptive sparse updates provide a powerful
mechanism to scale adaptation to diverse deployment contexts, from microcon-
trollers to mobile devices (Levy et al. 2023).
Each adaptation strategy for on-device learning offers a distinct balance between
expressivity, resource efÏciency, and implementation complexity. Understand-
ing these tradeoffs is essential when designing systems for diverse deployment
targets—from ultra-low-power microcontrollers to feature-rich mobile proces-
sors.
Bias-only adaptation is the most lightweight approach, updating only scalar
offsets in each layer while freezing all other parameters. This significantly re-
duces memory requirements and computational burden, making it suitable for
devices with tight memory and energy budgets. However, its limited expressiv-
ity means it is best suited to applications where the pretrained model already
captures most of the relevant task features and only minor local calibration is
required.
Residual adaptation, often implemented via adapter modules, introduces
a small number of trainable parameters into the frozen backbone of a neural
network. This allows for greater flexibility than bias-only updates, while still
maintaining control over the adaptation cost. Because the backbone remains
fixed, training can be performed efÏciently and safely under constrained condi-
tions. This method supports modular personalization across tasks and users,
making it a favorable choice for mobile settings where moderate adaptation
capacity is needed.
Task-adaptive sparse updates offer the greatest potential for task-specific
finetuning by selectively updating only a subset of layers or parameters based
on their contribution to downstream performance. While this method enables
expressive local adaptation, it requires a mechanism for layer selection, through
profiling, contribution analysis, or meta-training, which introduces additional
complexity. Nonetheless, when deployed carefully, it allows for dynamic trade-
offs between accuracy and efÏciency, particularly in systems that experience
large domain shifts or evolving input conditions.
14.5. Data EfÏciency 760
are employed to make use of limited data while minimizing capacity for mem-
orization. Let 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )}𝐾
𝑖=1 denote a 𝐾-shot dataset of labeled examples
collected on-device. The goal is to update the model parameters 𝜃 to improve
task performance under constraints such as:
• Limited number of gradient steps: 𝑇 ≪ 100
• Constrained memory footprint: ‖𝜃updated ‖ ≪ ‖𝜃‖
• Preservation of prior task knowledge (to avoid catastrophic forgetting)
Keyword spotting (KWS) systems offer a concrete example of few-shot adap-
tation in a real-world, on-device deployment (Warden 2018). These models are
used to detect fixed phrases, including phrases like “Hey Siri” or “OK Google”,
with low latency and high reliability. A typical KWS model consists of a pre-
trained acoustic encoder (e.g., a small convolutional or recurrent network that
transforms input audio into an embedding space) followed by a lightweight clas-
sifier. In commercial systems, the encoder is trained centrally using thousands
of hours of labeled speech across multiple languages and speakers. However,
supporting custom wake words (e.g., “Hey Jarvis”) or adapting to underrepre-
sented accents and dialects is often infeasible via centralized training due to
data scarcity and privacy concerns.
Few-shot adaptation solves this problem by finetuning only the output clas-
sifier or a small subset of parameters, including bias terms, using just a few
example utterances collected directly on the device. For example, a user might
provide 5–10 recordings of their custom wake word. These samples are then
used to update the model locally, while the main encoder remains frozen to
preserve generalization and reduce memory overhead. This enables personal-
ization without requiring additional labeled data or transmitting private audio
to the cloud.
Such an approach is not only computationally efÏcient, but also aligned
with privacy-preserving design principles. Because only the output layer is
updated, often involving a simple gradient step or prototype computation, the
total memory footprint and runtime compute are compatible with mobile-class
devices or even microcontrollers. This makes KWS a canonical case study
for few-shot learning at the edge, where the system must operate under tight
constraints while delivering user-specific performance.
Beyond static few-shot learning, many on-device scenarios benefit from
streaming adaptation, where models must learn incrementally as new data
arrives (Hayes et al. 2020). Streaming adaptation generalizes this idea to contin-
uous, asynchronous settings where data arrives incrementally over time. Let
{𝑥𝑡 }∞𝑡=1 represent a stream of observations. In streaming settings, the model
must update itself after observing each new input, typically without access to
prior data, and under bounded memory and compute. The model update can
be written generically as:
𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 ∇ℒ(𝑥𝑡 ; 𝜃𝑡 )
where 𝜂𝑡 is the learning rate at time 𝑡. This form of adaptation is sensitive to
noise and drift in the input distribution, and thus often incorporates mecha-
nisms such as learning rate decay, meta-learned initialization, or update gating
to improve stability.
14.5. Data EfÏciency 762
1 𝑘
𝜃𝑡+1 = 𝜃𝑡 − 𝜂∇𝜃 [ ∑ ℒ(𝑥𝑖 , 𝑦𝑖 ; 𝜃𝑡 )]
𝑘 𝑖=1
where 𝜃𝑡 are the model parameters, 𝜂 is the learning rate, and ℒ is the loss
function. Over time, this replay mechanism allows the model to reinforce prior
knowledge while incorporating new information.
13 A practical on-device implementation might use a ring buffer13 to store a
Ring Buffer: A circular buffer
small set of compressed feature vectors rather than full input examples. The
that efÏciently manages data by
pseudocode as shown in Listing 14.4 illustrates a minimal replay buffer designed
overwriting old entries with new
for constrained environments.
ones as space requires.
This implementation maintains a fixed-capacity cyclic buffer, storing com-
pressed representations (e.g., last-layer embeddings) and associated labels.
Such buffers are useful for replaying adaptation updates without violating
memory or energy budgets.
In TinyML applications, experience replay has been applied to problems such
as gesture recognition, where devices must continuously improve predictions
Chapter 14. On-Device Learning 763
while observing a small number of events per day. Instead of training directly
on the streaming data, the device stores representative feature vectors from
recent gestures and uses them to finetune classification boundaries periodically.
Similarly, in on-device keyword spotting, replaying past utterances can improve
wake-word detection accuracy without the need to transmit audio data off-
device.
While experience replay improves stability in data-sparse or non-stationary
environments, it introduces several tradeoffs. Storing raw inputs may breach
privacy constraints or exceed storage budgets, especially in vision and audio ap-
plications. Replaying from feature vectors reduces memory usage but may limit
the richness of gradients for upstream layers. Write cycles to persistent flash
memory, which are frequently necessary for long-term storage on embedded
devices, can also raise wear-leveling concerns14 . These constraints require care-
14
ful co-design of memory usage policies, replay frequency, and feature selection Wear leveling is a tech-
strategies, particularly in continuous deployment scenarios. nique used in flash memory man-
agement to distribute data writes
evenly across the memory, prolong-
14.5.3 Data Compression
ing lifespan.
In many on-device learning scenarios, the raw training data may be too large,
noisy, or redundant to store and process effectively. This motivates the use of
compressed data representations, where the original inputs are transformed
into lower-dimensional embeddings or compact encodings that preserve salient
information while minimizing memory and compute costs.
Compressed representations serve two complementary goals. First, they
reduce the footprint of stored data, allowing devices to maintain longer histories
14.5. Data EfÏciency 764
or replay buffers under tight memory budgets (Sanh et al. 2019). Second,
they simplify the learning task by projecting raw inputs into more structured
feature spaces, often learned via pretraining or meta-learning, in which efÏcient
adaptation is possible with minimal supervision.
One common approach is to encode data points using a pretrained feature
extractor and discard the original high-dimensional input. For example, an
image 𝑥𝑖 might be passed through a convolutional neural network (CNN) to
produce an embedding vector 𝑧𝑖 = 𝑓(𝑥𝑖 ), where 𝑓(⋅) is a fixed feature encoder.
This embedding captures visual structure (e.g., shape, texture, or spatial layout)
in a compact representation, usually ranging from 64 to 512 dimensions, suitable
for lightweight downstream adaptation.
Mathematically, training can proceed over compressed samples (𝑧𝑖 , 𝑦𝑖 ) using
a lightweight decoder or projection head. Let 𝜃 represent the trainable parame-
ters of this decoder model, which is typically a small neural network that maps
from compressed representations to output predictions. As each example is
presented, the model parameters are updated using gradient descent:
Here:
• 𝑧𝑖 is the compressed representation of the 𝑖-th input,
• 𝑦𝑖 is the corresponding label or supervision signal,
• 𝑔(𝑧𝑖 ; 𝜃) is the decoder’s prediction,
• ℒ is the loss function measuring prediction error,
• 𝜂 is the learning rate, and
• ∇𝜃 denotes the gradient with respect to the parameters 𝜃.
This formulation highlights how only a compact decoder model, which has
the parameter set 𝜃, needs to be trained, making the learning process feasible
even when memory and compute are limited.
Advanced approaches go beyond fixed encoders by learning discrete or
sparse dictionaries that represent data using low-rank or sparse coefÏcient
matrices. For instance, a dataset of sensor traces can be factorized as 𝑋 ≈ 𝐷𝐶,
where 𝐷 is a dictionary of basis patterns and 𝐶 is a block-sparse coefÏcient
matrix indicating which patterns are active in each example. By updating only
a small number of dictionary atoms or coefÏcients, the model can adapt with
minimal overhead.
Compressed representations are particularly useful in privacy-sensitive set-
tings, as they allow raw data to be discarded or obfuscated after encoding.
Furthermore, compression acts as an implicit regularizer, smoothing the learn-
ing process and mitigating overfitting when only a few training examples are
available.
In practice, these strategies have been applied in domains such as keyword
spotting, where raw audio signals are first transformed into Mel-frequency
cepstral coefÏcients (MFCCs)—a compact, lossy representation of the power
spectrum of speech. These MFCC vectors serve as compressed inputs for down-
stream models, enabling local adaptation using only a few kilobytes of memory.
Instead of storing raw audio waveforms, which are large and computationally
Chapter 14. On-Device Learning 765
expensive to process, devices store and learn from these compressed feature
vectors directly. Similarly, in low-power computer vision systems, embeddings
extracted from lightweight CNNs are retained and reused for few-shot learn-
ing. These examples illustrate how representation learning and compression
serve as foundational tools for scaling on-device learning to memory- and
bandwidth-constrained environments.
in, connected to Wi-Fi, and idle, to avoid interfering with user experience or
depleting battery resources. These criteria determine which subset of the total
population is considered “available” for any given training round.
Beyond these operational filters, devices also differ in their hardware capabil-
ities, data availability, and network conditions. For example, some smartphones
may contain many recent examples relevant to the current task, while others
may have outdated or irrelevant data. Network bandwidth and upload speed
may vary widely depending on geography and carrier infrastructure. As a
result, selecting clients at random can lead to poor coverage of the underlying
data distribution and unstable model convergence.
Moreover, availability-driven selection introduces participation bias: clients
with favorable conditions, including frequent charging, high-end hardware,
and consistent connectivity, are more likely to participate repeatedly, while
others are systematically underrepresented. This can skew the resulting model
toward behaviors and preferences of a privileged subset of the population,
raising both fairness and generalization concerns.
To address these challenges, systems must carefully balance scheduling efÏ-
ciency with client diversity. A key approach involves using stratified or quota-
based sampling to ensure representative client participation across different
groups. For instance, asynchronous buffer-based techniques allow participating
clients to contribute model updates independently, without requiring synchro-
nized coordination in every round (Nguyen et al. 2021). This model has been
extended to incorporate staleness awareness (Rodio and Neglia 2024) and fair-
ness mechanisms (J. Ma et al. 2024), preventing bias from over-active clients
who might otherwise dominate the training process.
To address these challenges, federated learning systems implement adap-
tive client selection strategies. These include prioritizing clients with under-
represented data types, targeting geographies or demographics that are less
frequently sampled, and using historical participation data to enforce fairness
constraints. Systems may also incorporate predictive modeling to anticipate
future client availability or success rates, improving training throughput.
Selected clients perform one or more local training steps on their private
data and transmit their model updates to a central server. These updates are
aggregated to form a new global model. Typically, this aggregation is weighted,
where the contributions of each client are scaled, for example, by the number of
local examples used during training, before averaging. This ensures that clients
with more representative or larger datasets exert proportional influence on the
global model.
These scheduling decisions directly impact system performance. They affect
convergence rate, model generalization, energy consumption, and overall user
experience. Poor scheduling can result in excessive stragglers, overfitting to
narrow client segments, or wasted computation. As a result, client scheduling
is not merely a logistical concern—it is a core component of system design in
federated learning, demanding both algorithmic insight and infrastructure-level
coordination.
Chapter 14. On-Device Learning 771
𝐾
min ∑ 𝑤𝑘 ℒ𝑘 (𝜃)
𝜃
𝑘=1
where ℒ𝑘 (𝜃) is the local loss on client 𝑘, and 𝑤𝑘 is a weighting factor (e.g.,
proportional to local dataset size). However, this formulation assumes that a
single model 𝜃 can serve all users well. In practice, local loss landscapes ℒ𝑘
often differ significantly across clients, reflecting non-IID data distributions
and varying task requirements.
Personalization modifies this objective to allow each client to maintain its
own adapted parameters 𝜃𝑘 , optimized with respect to both the global model
and local data:
𝐾
min ∑ (ℒ𝑘 (𝜃𝑘 ) + 𝜆 ⋅ ℛ(𝜃𝑘 , 𝜃global ))
𝜃1 ,…,𝜃𝐾
𝑘=1
14.8 Challenges
While on-device learning holds significant promise for enabling adaptive, pri-
vate, and efÏcient machine learning at the edge, its practical deployment in-
troduces a range of challenges that extend beyond algorithm design. Unlike
conventional centralized systems, where training occurs in controlled environ-
ments with uniform hardware and curated datasets, edge systems must contend
with heterogeneity in devices, fragmentation in data, and the absence of cen-
tralized validation infrastructure. These factors give rise to new systems-level
tradeoffs and open questions concerning reliability, safety, and maintainability.
Moreover, regulatory and operational constraints complicate the deployment
of self-updating models in real-world applications. This section explores these
limitations, emphasizing the systemic barriers that must be addressed to make
on-device learning robust, scalable, and trustworthy.
14.8.0.1 Heterogeneity
Federated and on-device learning systems must operate across a vast and di-
verse ecosystem of devices, ranging from smartphones and wearables to IoT
sensors and microcontrollers. This heterogeneity spans multiple dimensions:
hardware capabilities, software stacks, network connectivity, and power avail-
ability. Unlike cloud-based systems, where environments can be standardized
14.8. Challenges 776
Start: Deploying
On-Device Learning
Yes
Allow Partial or Full
Fine-Tuning
continuous power and reliable networking, but still prioritize user-facing re-
sponsiveness over background learning. These differences complicate the or-
chestration of coordinated learning and the scheduling of updates.
Finally, system fragmentation affects reproducibility and testing. With such a
wide range of execution environments, it is difÏcult to ensure consistent model
behavior or to debug failures reliably. This makes monitoring, validation,
and rollback mechanisms more critical—but also more difÏcult to implement
uniformly across the fleet.
Consider a federated learning deployment for mobile keyboards. A high-end
smartphone might feature 8 GB of RAM, a dedicated AI accelerator, and contin-
uous Wi-Fi access. In contrast, a budget device may have just 2 GB of RAM, no
hardware acceleration, and rely on intermittent mobile data. These disparities
influence how long training runs can proceed, how frequently models can be
updated, and even whether training is feasible at all. To support such a range,
the system must dynamically adjust training schedules, model formats, and
compression strategies—ensuring equitable model improvement across users
while respecting each device’s limitations.
accent, and speaking style, which results in significant differences across lo-
cal datasets. Some users may issue frequent, clearly enunciated commands,
while others speak infrequently or in noisy environments. These variations
cause device-specific gradients to diverge, especially when training wake-word
detectors or adapting language models locally.
In federated learning deployments for virtual keyboards, the problem is fur-
ther amplified. One user might primarily type in English, another in Hindi, and
a third may switch fluidly between multiple languages. The resulting training
data is highly non-IID—not only in language but also in vocabulary, phrasing,
and typing cadence. A global model trained on aggregated updates may de-
grade if it fails to capture these localized differences, highlighting the need
for adaptive, data-aware strategies that accommodate heterogeneity without
sacrificing collective performance.
Table 14.4: Challenges in on-device learning and their implications for system
design and deployment.
Challenge Root Cause System-Level Implications
System Heterogeneity Diverse hardware, software, and Limits portability; requires
toolchains platform-specific tuning
Non-IID and Fragmented Localized, user-specific data distributions Hinders generalization; increases risk
Data of drift
Limited Observability and No centralized testing or logging Makes update validation and
Feedback debugging difÏcult
Resource Contention and Competing demands for memory, Requires dynamic scheduling and
Scheduling compute, and battery budget-aware learning
Deployment and Learning continues post-deployment Complicates model versioning,
Compliance Risk auditing, and rollback
14.9 Conclusion
On-device learning is a major shift in the design and operation of machine
learning systems. Rather than relying exclusively on centralized training and
static model deployment, this paradigm enables systems to adapt dynamically
to local data and usage conditions. This shift is motivated by a confluence of
factors—ranging from the need for personalization and privacy preservation
to latency constraints and infrastructure efÏciency. However, it also introduces
a new set of challenges tied to the constrained nature of edge computing plat-
forms.
Throughout this chapter, we explored the architectural and algorithmic strate-
gies that make on-device learning feasible under tight compute, memory, en-
ergy, and data constraints. We began by establishing the motivation for moving
learning to the edge, followed by a discussion of the system-level limitations
that shape practical design choices. A core insight is that no single solution suf-
fices across all use cases. Instead, effective on-device learning systems combine
multiple techniques: minimizing the number of trainable parameters, reducing
runtime costs, leveraging memory-based adaptation, and compressing data
representations for efÏcient supervision.
Chapter 14. On-Device Learning 783
14.10 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 15
Purpose
What principles guide the protection of machine learning systems, and how do security
and privacy requirements shape system architecture?
Protection mechanisms are a fundamental dimension of modern AI system
design. Security considerations expose critical patterns for safeguarding data,
models, and infrastructure while sustaining operational effectiveness. Imple-
menting defensive strategies reveals inherent trade-offs between protection,
performance, and usability—trade-offs that influence architectural decisions
throughout the AI lifecycle. Understanding these dynamics is essential for
creating trustworthy systems, grounding the principles needed to preserve
privacy and defend against adversarial threats while maintaining functionality
in production environments.
785
15.1. Overview 786
L Learning Objectives
15.1 Overview
Machine learning systems, like all computational systems, must be designed
not only for performance and accuracy but also for security and privacy. These
concerns shape the architecture and operation of ML systems across their
lifecycle—from data collection and model training to deployment and user
interaction. While traditional system security focuses on software vulnerabil-
ities, network protocols, and hardware defenses, machine learning systems
introduce additional and unique attack surfaces. These include threats to the
data that fuels learning, the models that encode behavior, and the infrastructure
that serves predictions.
Security and privacy mechanisms in ML systems serve roles analogous to
trust and access control layers in classical computing. Just as operating systems
enforce user permissions and protect resource boundaries, ML systems must
implement controls that safeguard sensitive data, defend proprietary models,
and mitigate adversarial manipulation. These mechanisms span software,
hardware, and organizational layers, forming a critical foundation for system
reliability and trustworthiness.
Although closely related, security and privacy address distinct aspects of
protection. Security focuses on ensuring system integrity and availability in
the presence of adversaries. Privacy, by contrast, emphasizes the control and
protection of sensitive information, even in the absence of active attacks. These
concepts often interact, but they are not interchangeable. To effectively design
and evaluate defenses for ML systems, it is essential to understand how these
goals differ, how they reinforce one another, and what distinct mechanisms
they entail.
Security and privacy often function as complementary forces. Security pre-
vents unauthorized access and protects system behavior, while privacy mea-
sures limit the exposure of sensitive information. Their synergy is essential:
strong security supports privacy by preventing data breaches, while privacy-
preserving techniques reduce the attack surface available to adversaries. How-
ever, achieving robust protection on both fronts often introduces trade-offs.
Defensive mechanisms may incur computational overhead, increase system
Chapter 15. Security & Privacy 787
Security Definition
Privacy Definition
Table 15.1: How security and privacy concerns manifest differently in machine
learning systems. Security focuses on protecting against active
threats that seek to manipulate or disrupt system behavior, while pri-
vacy emphasizes safeguarding sensitive information from exposure,
even in benign operational contexts.
Aspect Security Privacy
Primary Goal Prevent unauthorized access or Limit exposure of sensitive information
disruption
Threat Model Adversarial actors (external or internal) Honest-but-curious observers or passive
leaks
Typical Concerns Model theft, poisoning, evasion attacks Data leakage, re-identification,
memorization
Example Attack Adversarial inputs cause Model inversion reveals training data
misclassification
Representative Defenses Access control, adversarial training Differential privacy, federated learning
Relevance to Regulation Emphasized in cybersecurity standards Central to data protection laws (e.g.,
GDPR)
15.3.1 Stuxnet
In 2010, security researchers discovered a highly sophisticated computer worm
later named Stuxnet, which targeted industrial control systems used in Iran’s
Natanz nuclear facility (Farwell and Rohozinski 2011). Stuxnet exploited four
previously unknown “zero-day” vulnerabilities in Microsoft Windows, allow-
ing it to spread undetected through both networked and isolated systems.
Unlike typical malware designed to steal information or perform espionage,
Stuxnet was engineered to cause physical damage. Its objective was to disrupt
uranium enrichment by sabotaging the centrifuges used in the process. Despite
the facility being air-gapped from external networks, the malware is believed
to have entered the system via an infected USB device, demonstrating how
physical access can compromise even isolated environments.
Stuxnet represents a landmark in cybersecurity, revealing how malicious
software can bridge the digital and physical worlds to manipulate industrial
infrastructure. It specifically targeted programmable logic controllers (PLCs)
responsible for automating electromechanical processes, such as controlling the
speed of centrifuges. By exploiting vulnerabilities in the Windows operating
system and the Siemens Step7 software used to program the PLCs, Stuxnet
achieved highly targeted, real-world disruption.
While Stuxnet did not target machine learning systems directly, its relevance
extends to any system where software interacts with physical processes. Ma-
chine learning is increasingly integrated into industrial control, robotics, and
cyber-physical systems, making these lessons applicable to the security of mod-
ern ML deployments. Figure 15.2 illustrates the operation of Stuxnet in greater
detail.
15.3. Historical Incidents 790
TV Watch on YouTube
While the devices exploited by Mirai did not include machine learning com-
Scan with your phone
ponents, the architectural patterns exposed by this incident are increasingly rel- to watch the video
evant as machine learning expands into edge computing and Internet of Things
(IoT) devices. Many ML-enabled products, such as smart cameras, voice assis-
tants, and edge analytics platforms, share similar deployment characteristics—
operating on networked devices with limited hardware resources, often man-
aged at scale.
The Mirai botnet highlights the critical importance of basic security hygiene,
including secure credential management, authenticated software updates, and
15.4. Secure Design Priorities 792
lifecycle, where trained models are exposed through APIs, on-device engines,
or serialized files. This threat sits alongside others, including data poisoning
during training and adversarial attacks during inference, that together span the
full pipeline from data collection to real-time prediction. Understanding the
lifecycle positioning of each threat helps clarify their distinct attack surfaces
and appropriate defenses.
Data Collection
Label
Backdoors Training
Manipulation
Adversarial Membership
Inference
Examples Inference
Machine learning models are not solely passive targets of attack; in some
cases, they can themselves be employed as components of an attack strategy.
Pretrained models, particularly large generative or discriminative networks,
may be adapted to automate tasks such as adversarial example generation,
phishing content synthesis, or protocol subversion. Furthermore, open-source
or publicly accessible models can be fine-tuned for malicious purposes, in-
cluding impersonation, surveillance, or reverse-engineering of secure systems.
This dual-use potential necessitates a broader security perspective—one that
considers models not only as assets to defend but also as possible instruments
of attack.
High-profile legal cases have highlighted the strategic and economic value
of machine learning models. For example, former Google engineer Anthony
Levandowski was accused of stealing proprietary designs from Waymo, includ-
ing critical components of its autonomous vehicle technology, before founding
a competing startup. Such cases illustrate the potential for insider threats to
bypass technical protections and gain access to sensitive intellectual property.
The consequences of model theft extend beyond economic loss. Stolen mod-
els can be used to extract sensitive information, replicate proprietary algorithms,
or enable further attacks. For instance, a competitor who obtains a stolen rec-
ommendation model from an e-commerce platform might gain insights into
customer behavior, business analytics, and embedded trade secrets. This knowl-
edge can also be used to conduct model inversion attacks, where an attacker
attempts to infer private details about the model’s training data (Fredrikson,
Jha, and Ristenpart 2015).
In a model inversion attack, the adversary queries the model through a legit-
imate interface, such as a public API, and observes its outputs. By analyzing
confidence scores or output probabilities, the attacker can optimize inputs to
reconstruct data resembling the model’s training set. For example, a facial recog-
nition model used for secure access could be manipulated to reveal statistical
properties of the employee photos on which it was trained. Similar vulnera-
bilities have been demonstrated in studies on the Netflix Prize dataset, where
researchers were able to infer individual movie preferences from anonymized
data (A. Narayanan and Shmatikov 2006).
Model theft can target two distinct objectives: extracting exact model prop-
erties, such as architecture and parameters, or replicating approximate model
behavior to produce similar outputs without direct access to internal repre-
sentations. Both forms of theft undermine the security and value of machine
learning systems, as explored in the following subsections.
These two attack paths are illustrated in Figure 15.4. In exact model theft,
the attacker gains access to the model’s internal components, including seri-
alized files, weights, and architecture definitions, and reproduces the model
directly. In contrast, approximate model theft relies on observing the model’s
input-output behavior, typically through a public API. By repeatedly querying
the model and collecting responses, the attacker trains a surrogate that mim-
ics the original model’s functionality. While the first approach compromises
the model’s internal design and training investment, the second threatens its
predictive value and can facilitate further attacks such as adversarial example
transfer or model inversion.
Record responses
Reconstruct original
Train surrogate model model
These attacks typically seek three types of information. The first is the model’s
learned parameters, such as weights and biases. By extracting these parameters,
attackers can replicate the model’s functionality without incurring the cost of
training. This replication allows them to benefit from the model’s performance
while bypassing the original development effort.
The second target is the model’s fine-tuned hyperparameters, including
training configurations such as learning rate, batch size, and regularization
settings. These hyperparameters significantly influence model performance,
and stealing them enables attackers to reproduce high-quality results with
minimal additional experimentation.
Finally, attackers may seek to reconstruct the model’s architecture. This
includes the sequence and types of layers, activation functions, and connec-
tivity patterns that define the model’s behavior. Architecture theft may be
accomplished through side-channel attacks, reverse engineering, or analysis of
observable model behavior. Revealing the architecture not only compromises
intellectual property but also gives competitors strategic insights into the design
choices that provide competitive advantage.
System designers must account for these risks by securing model serializa-
tion formats, restricting access to runtime APIs, and hardening deployment
pipelines. Protecting models requires a combination of software engineering
practices, including access control, encryption, and obfuscation techniques, to
reduce the risk of unauthorized extraction (Tramèr et al. 2016).
serve the model’s inputs and outputs to build a substitute model that performs
similarly on the same tasks.
This type of theft often targets models deployed as services, where the model
is exposed through an API or embedded in a user-facing application. By re-
peatedly querying the model and recording its responses, an attacker can train
their own model to mimic the behavior of the original. This process, often
called model distillation or knockoff modeling, enables attackers to achieve
comparable functionality without access to the original model’s proprietary
internals (Orekondy, Schiele, and Fritz 2019).
Attackers may evaluate the success of behavior replication in two ways. The
first is by measuring the level of effectiveness of the substitute model. This
involves assessing whether the cloned model achieves similar accuracy, preci-
sion, recall, or other performance metrics on benchmark tasks. By aligning the
substitute’s performance with that of the original, attackers can build a model
that is practically indistinguishable in effectiveness, even if its internal structure
differs.
The second is by testing prediction consistency. This involves checking
whether the substitute model produces the same outputs as the original model
when presented with the same inputs. Matching not only correct predictions
but also the original model’s mistakes can provide attackers with a high-fidelity
reproduction of the target model’s behavior. This is particularly concerning
in applications such as natural language processing, where attackers might
replicate sentiment analysis models to gain competitive insights or bypass
proprietary systems.
Approximate behavior theft is particularly challenging to defend against
in open-access deployment settings, such as public APIs or consumer-facing
applications. Limiting the rate of queries, detecting automated extraction
patterns, and watermarking model outputs are among the techniques that can
help mitigate this risk. However, these defenses must be balanced with usability
and performance considerations, especially in production environments.
One notable demonstration of approximate model theft focuses on extracting
internal components of black-box language models via public APIs. In their
paper, Carlini et al. (2024), researchers show how to reconstruct the final em-
bedding projection matrix of several OpenAI models, including ada, babbage,
and gpt-3.5-turbo, using only public API access. By exploiting the low-rank
structure of the output projection layer and making carefully crafted queries,
they recover the model’s hidden dimensionality and replicate the weight matrix
up to afÏne transformations.
While the attack does not reconstruct the full model, it reveals critical internal
architecture parameters and sets a precedent for future, deeper extractions. This
work demonstrated that even partial model theft poses risks to confidentiality
and competitive advantage, especially when model behavior can be probed
through rich API responses such as logit bias and log-probabilities.
Chapter 15. Security & Privacy 799
Table 15.2: Model theft results from Carlini et al. (2024). The table summarizes
the model sizes, number of queries required for dimension extrac-
tion, root mean square errors (RMS) for weight matrix extraction,
and estimated costs based on OpenAI’s API pricing.
Size
(Dimension Number of RMS (Weight
Model Extraction) Queries Matrix Extraction) Cost (USD)
OpenAI ada 1024 ✓ < (2 �10^6) (5 �10^{-4}) $1 / $4
OpenAI babbage 2048 ✓ < (4 �10^6) (7 �10^{-4}) $2 / $12
OpenAI babbage-002 1536 ✓ < (4 �10^6) Not implemented $2 / $12
OpenAI Not disclosed < (4 �10^7) Not implemented $200 / ~$2,000
gpt-3.5-turbo-instruct (estimated)
OpenAI Not disclosed < (4 �10^7) Not implemented $800 / ~$8,000
gpt-3.5-turbo-1106 (estimated)
where 𝑓𝐷∪𝐷𝑝 is the model trained on the combined dataset. For targeted attacks,
this objective may focus on specific inputs 𝑥𝑡 and target labels 𝑦𝑡 :
max ℒ(𝑓𝐷∪𝐷𝑝 , 𝑥𝑡 , 𝑦𝑡 )
𝐷𝑝
number of stop sign images labeled as speed limit signs into the training data.
The attacker’s goal is to subtly shift the model’s decision boundary so that future
stop signs are misclassified as speed limit signs. In this case, the poisoning
data 𝐷𝑝 consists of mislabeled stop sign images, and the attacker’s objective
is to maximize the misclassification of legitimate stop signs 𝑥𝑡 as speed limit
signs 𝑦𝑡 , following the targeted attack formulation above. Even if the model
performs well on other types of signs, the poisoned training process creates a
predictable and exploitable vulnerability.
Data poisoning attacks can be classified based on their objectives and scope of
impact. Availability attacks degrade overall model performance by introducing
noise or label flips that reduce accuracy across tasks. Targeted attacks manip-
ulate a specific input or class, leaving general performance intact but causing
consistent misclassification in select cases. Backdoor attacks embed hidden trig-
gers, which are often imperceptible patterns, that elicit malicious behavior only
when the trigger is present. Subpopulation attacks degrade performance on a
specific group defined by shared features, making them particularly dangerous
in fairness-sensitive applications.
A notable real-world example of a targeted poisoning attack was demon-
strated against Perspective, an online toxicity detection model (Hosseini et
al. 2017). By injecting synthetically generated toxic comments with subtle
misspellings and grammatical errors into the model’s training set, researchers
degraded its ability to detect harmful content. After retraining, the poisoned
model exhibited a significantly higher false negative rate, allowing offensive
language to bypass filters. This case illustrates how poisoned data can exploit
feedback loops in systems that rely on user-generated input, leading to re-
duced effectiveness over time and creating long-term vulnerabilities in content
moderation pipelines.
Mitigating data poisoning threats requires end-to-end security of the data
pipeline, encompassing collection, storage, labeling, and training. Preventa-
tive measures include input validation checks, integrity verification of training
datasets, and anomaly detection to flag suspicious patterns. In parallel, robust
training algorithms can limit the influence of mislabeled or manipulated data
by down-weighting or filtering out anomalous instances. While no single tech-
nique guarantees immunity, combining proactive data governance, automated
monitoring, and robust learning practices is essential for maintaining model
integrity in real-world deployments.
their sensitivity to small, targeted perturbations that can drastically alter output
confidence or classification results.
The central vulnerability arises from the model’s sensitivity to small, targeted
perturbations. A single image, for instance, can be subtly altered, by altering
only a few pixel values, such that a classifier misidentifies a stop sign as a speed
limit sign. In natural language processing, specially crafted input sequences
may trigger toxic or misleading outputs in a generative model, even when the
prompt appears benign to a human reader (Ramesh et al. 2021; Rombach et al.
2022).
Adversarial attacks pose critical safety and security risks in domains such as
autonomous driving, biometric authentication, and content moderation. Unlike
data poisoning, which corrupts the model during training, adversarial attacks
manipulate the model’s behavior at test time, often without requiring any access
to the training data or model internals. The attack surface thus shifts from
upstream data pipelines to real-time interaction, demanding robust defense
mechanisms capable of detecting or mitigating malicious inputs at the point of
inference.
Adversarial example generation can be formally described as a constrained
optimization problem, where the attacker seeks to find a minimally perturbed
version of a legitimate input that maximizes the model’s prediction error. Given
an input 𝑥 with true label 𝑦, the attacker’s objective is to find a perturbed input
𝑥′ = 𝑥 + 𝛿 that maximizes the model’s loss:
These threat types span different stages of the ML lifecycle and demand dis-
tinct defensive strategies. Table 15.4 below summarizes their key characteristics.
Table 15.4: Summary of threat types to ML models by lifecycle stage and attack
vector.
Threat Type Lifecycle Stage Attack Vector Example Impact
Model Theft Deployment API access, insider leaks Stolen IP, model inversion, behavioral
clone
Data Poisoning Training Label flipping, Targeted misclassification, degraded
backdoors accuracy
Adversarial Attacks Inference Input perturbation Real-time misclassification, safety
failure
The appropriate defense for a given threat depends on its type, attack vector,
and where it occurs in the ML lifecycle. Figure 15.6 provides a simplified
decision flow that connects common threat categories, such as model theft, data
poisoning, and adversarial examples, to corresponding defensive strategies.
While real-world deployments may require more nuanced or layered defenses,
this flowchart serves as a conceptual guide for aligning threat models with
practical mitigation techniques.
platforms they run on. Whether deployed in data centers, on edge devices, or
in embedded systems, machine learning applications rely on a layered stack of
processors, accelerators, memory, and communication interfaces. These hard-
ware components, while essential for enabling efÏcient computation, introduce
unique security risks that go beyond traditional software-based vulnerabilities.
Unlike general-purpose software systems, machine learning workflows often
process high-value models and sensitive data in performance-constrained envi-
ronments. This makes them attractive targets not only for software attacks but
also for hardware-level exploitation. Vulnerabilities in hardware can expose
models to theft, leak user data, disrupt system reliability, or allow adversaries
to manipulate inference results. Because hardware operates below the software
stack, such attacks can bypass conventional security mechanisms and remain
difÏcult to detect.
These hardware threats arise from multiple sources, including design flaws in
hardware architectures, physical tampering, side-channel leakage, and supply
chain compromises. Together, they form a critical attack surface that must be
addressed to build trustworthy machine learning systems.
Table 15.5 summarizes the major categories of hardware security threats,
describing their origins, methods, and implications for machine learning system
design and deployment.
While these attacks were first demonstrated on general-purpose CPUs, their Intel’s SGX enclaves, allowing data
implications extend to machine learning accelerators and specialized hardware. leaks from supposedly secure mem-
ML systems often rely on heterogeneous compute platforms that combine CPUs ory regions.
Figure 15.11 shows the case where the password is entirely incorrect (0x30,
0x30, 0x30, 0x30, 0x30). Here, the device detects the mismatch immediately
after the first byte and halts processing much earlier. This is again visible in
the power profile, where the blue line exhibits a sharp jump following the first
byte, reflecting the device’s early termination of authentication.
These examples demonstrate how attackers can exploit observable power con-
sumption differences to reduce the search space and eventually recover secret
data through brute-force analysis. For a more detailed walkthrough, Video 10
provides a step-by-step demonstration of how these attacks are performed.
Watch on YouTube
Power Attack çĖ Important 10: Power Attack
TV Watch on YouTube
failure to secure these access points risks undermining the entire system’s
trustworthiness.
data security, to ensure that ML systems can operate reliably and securely in
the real world.
Model-Level Security
System-Level Security
Hardware-Level Security
differing in one record, and for all outputs 𝑆 ⊆ Range(𝒜), the following holds:
Pr[𝒜(𝐷) ∈ 𝑆] ≤ 𝑒𝜖 Pr[𝒜(𝐷′ ) ∈ 𝑆]
This bound ensures that the algorithm’s behavior remains statistically indis-
tinguishable regardless of whether any individual’s data is present, thereby
limiting the information that can be inferred about that individual. In practice,
DP is implemented by adding calibrated noise to model updates or query re-
sponses, using mechanisms such as the Laplace or Gaussian mechanism. Train-
ing techniques like differentially private stochastic gradient descent (DP-SGD)
integrate noise into the optimization process to ensure per-iteration privacy
guarantees.
While differential privacy offers strong theoretical assurances, it introduces
a trade-off between privacy and utility. Increasing the noise to reduce 𝜖 may
degrade model accuracy, especially in low-data regimes or fine-grained classifi-
cation tasks. Consequently, DP is often applied selectively—either during train-
ing on sensitive datasets or at inference when returning aggregate statistics—to
balance privacy with performance goals (Dwork and Roth 2013).
(𝑘)
Here, 𝜃𝑡 represents the model update from client 𝑘, 𝑛𝑘 the number of
samples held by that client, and 𝑛 the total number of samples across all clients.
This weighted aggregation allows the global model to learn from distributed
data without direct access to it. While FL reduces the exposure of raw data,
it still leaks information through gradients, motivating the use of DP, secure
aggregation, and hardware-based protections in federated settings.
To address scenarios requiring computation on encrypted data, homomor-
phic encryption (HE) and secure multiparty computation (SMPC) allow models
to perform inference or training over encrypted inputs. In the case of HE, opera-
tions on ciphertexts correspond to operations on plaintexts, enabling encrypted
inference:
Enc(𝑓(𝑥)) = 𝑓(Enc(𝑥))
This property supports privacy-preserving computation in untrusted envi-
ronments, such as cloud inference over sensitive health or financial records.
However, the computational cost of HE remains high, making it more suitable
for fixed-function models and low-latency batch tasks. SMPC, by contrast,
distributes the computation across multiple parties such that no single party
learns the complete input or output. This is particularly useful in joint training
across institutions with strict data-use policies, such as hospitals or banks.
tures can also reduce the risk of reverse engineering or information leakage
through side-channel analysis. In some cases, model designers may embed
imperceptible watermarks, which are unique signatures embedded in the pa-
rameters or behavior of the model, that can later be used to demonstrate own-
ership in cases of misappropriation (Uchida et al. 2017). These design-time
protections are particularly important for commercially valuable models, where
intellectual property rights are at stake.
Once training is complete, the model must be securely packaged for deploy-
ment. Storing models in plaintext formats, including unencrypted ONNX or
PyTorch checkpoint files, can expose internal structures and parameters to
attackers with access to the file system or memory. To mitigate this risk, models
should be encrypted, obfuscated, or wrapped in secure containers. Decryption
keys should be made available only at runtime and only within trusted envi-
ronments. Additional mechanisms, such as quantization-aware encryption or
integrity-checking wrappers, can prevent tampering and ofÒine model theft.
Deployment environments must also enforce strong access control policies
to ensure that only authorized users and services can interact with inference
endpoints. Authentication protocols, including OAuth tokens, mutual TLS,
or API keys, should be combined with role-based access control (RBAC) to
restrict access according to user roles and operational context. For instance,
OpenAI’s hosted model APIs require users to include an OPENAI_API_KEY
when submitting inference requests. This key authenticates the client and
enables the backend to enforce usage policies, monitor for abuse, and log access
patterns. A simplified example of secure usage is shown in Listing 15.1, where
the API key is securely loaded from an environment variable before being used
to authenticate requests.
In this example, the API key is retrieved from an environment variable—
avoiding the security risk of hardcoding it into source code or exposing it
to the client side. Such key-based access control mechanisms are simple to
implement but require careful key management and monitoring to prevent
misuse, unauthorized access, or model extraction.
Beyond endpoint access, the integrity of the deployment pipeline itself must
also be protected. Continuous integration and deployment (CI/CD) workflows
that automate model updates should enforce cryptographic signing of artifacts,
dependency validation, and infrastructure hardening. Without these controls,
adversaries could inject malicious models or alter existing ones during the build
and deployment process. Verifying model signatures and maintaining audit
trails helps ensure that only authorized models are deployed into production.
When applied together, these practices protect against a range of threats—
from model theft and unauthorized inference access to tampering during de-
ployment and output manipulation at runtime. No single mechanism sufÏces
in isolation, but a layered strategy, beginning at the design phase and extend-
ing through deployment, provides a strong foundation for securing machine
learning systems under real-world conditions.
15.7. Defensive Strategies 824
Listing 15.1: Example of securely loading an API key for OpenAI’s GPT-4 model.
The API key is retrieved from an environment variable to avoid hardcoding
sensitive information in the source code.
import openai
import os
print(response.choices[0].message["content"])
rare classes, this may indicate the presence of adversarial inputs or a shift in
the underlying data distribution. Monitoring the entropy of the output dis-
tribution can similarly reveal when the model is overly certain in ambiguous
contexts—an early signal of possible manipulation.
In content moderation systems, a model that normally outputs neutral or
“safe” labels may suddenly begin producing high-confidence “safe” labels for
inputs containing offensive or restricted content. Output monitoring can detect
this mismatch by comparing predictions against auxiliary signals or known-
safe reference sets. When deviations are detected, the system may trigger a
fallback policy—such as escalating the content for human review or switching
to a conservative baseline model.
Time-series models also benefit from output monitoring. For instance, an
anomaly detection model used in fraud detection might track predicted fraud
scores for sequences of financial transactions. A sudden drop in fraud scores,
especially during periods of high transaction volume, may indicate model tam-
pering, label leakage, or evasion attempts. Monitoring the temporal evolution of
predictions provides a broader perspective than static, pointwise classification.
Generative models, such as text-to-image systems, introduce unique output
monitoring challenges. These models can produce high-fidelity imagery that
may inadvertently violate content safety policies, platform guidelines, or user
expectations. To mitigate these risks, post-generation classifiers are commonly
employed to assess generated content for objectionable characteristics such as
violence, nudity, or brand misuse. These classifiers operate downstream of the
generative model and can suppress, blur, or reject outputs based on predefined
thresholds. Some systems also inspect internal representations (e.g., attention
maps or latent embeddings) to anticipate potential misuse before content is
rendered.
However, prompt filtering alone is insufÏcient for safety. Research has shown
that text-to-image systems can be manipulated through implicitly adversarial
prompts, which are queries that appear benign but lead to policy-violating
outputs. The Adversarial Nibbler project introduces an open red teaming
methodology that identifies such prompts and demonstrates how models like
Stable Diffusion can produce unintended content despite the absence of explicit
trigger phrases (Quaye et al. 2024). These failure cases often bypass prompt
filters because their risk arises from model behavior during generation, not
from syntactic or lexical cues.
By physically separating secure execution and key management from the main
system, this architecture limits the impact of system-level compromises and
forms the foundation of hardware-enforced trust.
NAND flash
DRAM
storage
Figure 15.14: System-on-chip secure
enclave. Source: Apple.
Memory controller
NAND flash
controller
Application
Processor
AES engine
TRNG
Secure Enclave
AES Engine
Secure Enclave Memory Protection
Processor Engine
PKA
I2C bus
Secure Enclave
System on chip
for legacy infrastructure. Developers must adhere to strict protocols for isolation,
attestation, and secure update management, which can extend development
cycles and complicate testing workflows. TEEs can also introduce performance
overhead, particularly when cryptographic operations are involved, or when
context switching between trusted and untrusted modes is frequent.
Energy efÏciency is another consideration, particularly in battery-constrained
devices. TEEs typically consume additional power due to secure memory ac-
cesses, cryptographic computation, and hardware protection logic. In resource-
limited embedded systems, these costs may limit their use. In terms of scala-
bility and flexibility, the secure boundaries enforced by TEEs may complicate
distributed training or federated inference workloads, where secure coordina-
tion between enclaves is required.
Market demand also varies. In some consumer applications, perceived threat
levels may be too low to justify the integration of TEEs. Moreover, systems
with TEEs may be subject to formal security certifications, such as Common
Criteria or evaluation under ENISA, which can introduce additional time and
expense. For this reason, TEEs are typically adopted only when the expected
threat model, including adversarial users, cloud tenants, and malicious insiders,
justifies the investment.
Nonetheless, TEEs remain a powerful hardware primitive in the machine
learning security landscape. When paired with software- and system-level
defenses, they provide a trusted foundation for executing ML models securely,
privately, and verifiably, especially in scenarios where adversarial compromise
of the host environment is a serious concern.
Here is the revised 7.5.2 Secure Boot section, rewritten in formal textbook
tone with all original technical content, hyperlinks, and figures preserved. The
structure emphasizes narrative clarity, avoids bullet lists, and integrates the
Apple Face ID case study naturally.
Power up
Boot abort
Integration complexity also grows when HSMs are introduced into existing
ML pipelines. Interfacing between the HSM and the host processor requires
dedicated APIs and often specialized software development. Firmware and
model updates must be routed through secure, signed channels, and update
orchestration must account for device-specific key provisioning. These require-
ments increase the operational burden, especially in large deployments.
Scalability presents its own set of challenges. Managing a distributed fleet of
HSM-equipped devices requires secure provisioning of individual keys, secure
identity binding, and coordinated trust management. In large ML deployments,
including fleets of smart sensors or edge inference nodes, ensuring uniform
security posture across all devices is nontrivial.
Finally, the use of HSMs often requires organizations to engage in certifi-
cation and compliance processes, particularly when handling regulated data.
Meeting standards such as FIPS 140-2 or Common Criteria adds time and cost
to development. Access to the HSM is typically restricted to a small set of
authorized personnel, which can complicate development workflows and slow
iteration cycles.
Despite these operational complexities, HSMs remain a valuable option
for machine learning systems that require high assurance of cryptographic
integrity and access control. When paired with TEEs, secure boot, and software-
based defenses, HSMs contribute to a multilayered security model that spans
hardware, system software, and ML runtime.
Figure 15.17 provides a conceptual framework to guide this process across tech-
nical and deployment dimensions. The design flow begins with a thorough
assessment of the threat model and deployment context, which informs the
selection of appropriate defenses across the system stack. This includes data-
layer protections such as differential privacy (DP), federated learning (FL), and
encryption; model-layer defenses like robustness techniques, watermarking,
and secure deployment practices; runtime-layer measures such as input vali-
dation and output monitoring; and hardware-layer solutions including TEEs,
secure boot, and PUFs.
Each of these scenarios illustrates how machine learning models can serve as
amplifiers of adversarial capability. For example, language models enable more
convincing and adaptable phishing attacks, while clustering and classification
algorithms facilitate reconnaissance by learning system-level behavioral pat-
terns. Similarly, adversarial example generators and inference models system-
atically uncover weaknesses in decision boundaries or data privacy protections,
often requiring only limited external access to deployed systems. In hardware
contexts, as discussed in the next section, deep neural networks trained on side-
channel data can automate the extraction of cryptographic secrets from physical
measurements—transforming an expert-driven process into a learnable pattern
recognition task.
Although these applications differ in technical implementation, they share
a common foundation: the adversary replaces a static exploit with a learned
model capable of approximating or adapting to the target’s vulnerable behav-
ior. This shift increases flexibility, reduces manual overhead, and improves
robustness in the face of evolving or partially obscured defenses.
What makes this class of threats particularly significant is their favorable
scaling behavior. Just as accuracy in computer vision or language modeling
improves with additional data, larger architectures, and greater compute re-
sources, so too does the performance of attack-oriented machine learning mod-
els. A model trained on larger corpora of phishing attempts or power traces, for
instance, may generalize more effectively, evade more detectors, or require fewer
inputs to succeed. The same ecosystem that drives innovation in beneficial AI,
15.8. Offensive Capabilities 842
15.9 Conclusion
Security and privacy are foundational to the deployment of machine learning
systems in real-world environments. As ML moves beyond the lab and into pro-
duction, as it is deployed across cloud services, edge devices, mobile platforms,
and critical infrastructure, the threats it faces become more complex and more
consequential. From model theft and data leakage to adversarial manipulation
and hardware compromise, securing ML systems requires a comprehensive
understanding of the entire software and hardware stack.
This chapter explored these challenges from multiple angles. We began by
examining real-world security incidents and threat models that impact ML
systems, including attacks on training data, inference pipelines, and deployed
models. We then discussed defense strategies that operate at different layers of
the system: from data privacy techniques like differential privacy and federated
learning, to robust model design, secure deployment practices, runtime moni-
toring, and hardware-enforced trust. Each of these layers addresses a distinct
surface of vulnerability, and together they form the basis of a defense-in-depth
approach.
Importantly, security is not a static checklist. It is an evolving process shaped
by the deployment context, the capabilities of adversaries, and the risk tolerance
of stakeholders. What protects a publicly exposed API may not sufÏce for an
embedded medical device or a distributed fleet of autonomous systems. The
effectiveness of any given defense depends on how well it fits into the larger
system and how it interacts with other components, users, and constraints.
The goal of this chapter was not to catalog every threat or prescribe a fixed set
of solutions. Rather, it was to help build the mindset needed to design secure,
private, and trustworthy ML systems—systems that perform reliably under
pressure, protect the data they rely on, and respond gracefully when things go
wrong.
As we look ahead, security and privacy will remain intertwined with other
system concerns: robustness, fairness, sustainability, and operational scale.
In the chapters that follow, we will explore these additional dimensions and
extend the foundation laid here toward the broader challenge of building ML
systems that are not only performant, but responsible, reliable, and resilient by
design.
15.10 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
Chapter 15. Security & Privacy 845
¸Î Exercises
• Coming soon.
Chapter 16
Responsible AI
Purpose
How do human values translate into machine learning systems architecture, and what
principles enable responsible system behavior at scale?
Machine learning systems do not exist in isolation—they operate within
social, economic, and technical environments where their outputs affect people
and institutions. As these systems grow in capability and reach, questions
of responsibility become central to their design. The integration of fairness,
transparency, and accountability is not an afterthought but a systems-level
constraint that shapes data pipelines, model architectures, and deployment
strategies. Recognizing the moral dimension of engineering choices is essential
for building machine learning systems that serve human needs, avoid harm,
and support long-term trust in automation.
847
16.1. Overview 848
L Learning Objectives
16.1 Overview
Machine learning systems are increasingly deployed in high-stakes domains
such as healthcare, criminal justice, and employment. As their influence ex-
pands, so do the risks of embedding bias, compromising privacy, and enabling
unintended harms. For example, a loan approval model trained exclusively on
data from high-income neighborhoods may unfairly penalize applicants from
underrepresented communities, reinforcing structural inequities.
Definition of Responsible AI
Because these systems are trained on historical data, they are susceptible to
reproducing and amplifying patterns of systemic bias2 embedded in that data.
2
Systemic Bias: Deep-rooted Without careful design, machine learning systems may unintentionally rein-
bias in societal structures, often un- force social inequities rather than mitigate them.
consciously perpetuated. A widely studied example comes from the healthcare domain. An algorithm
used to allocate care management resources in U.S. hospitals was found to
systematically underestimate the health needs of Black patients (Obermeyer et
al. 2019). The model used healthcare expenditures as a proxy for health status,
but due to longstanding disparities in access and spending, Black patients were
less likely to incur high costs. As a result, the model inferred that they were less
sick, despite often having equal or greater medical need. This case illustrates
how seemingly neutral design choices such as proxy variable3 selection can yield
3
Proxy Variable: A variable discriminatory outcomes when historical inequities are not properly accounted
used to indirectly represent another for.
where direct measures are unavail- To evaluate fairness, a range of formal criteria have been developed that
able. quantify how models perform across groups defined by sensitive attributes.
Suppose a model ℎ(𝑥) predicts a binary outcome, such as loan repayment, and
let 𝑆 represent a sensitive attribute with subgroups 𝑎 and 𝑏. Several widely
used fairness definitions are:
𝑃 (ℎ(𝑥) = 1 ∣ 𝑆 = 𝑎) = 𝑃 (ℎ(𝑥) = 1 ∣ 𝑆 = 𝑏)
This means the model assigns favorable outcomes, such as loan approval
or treatment referral, at equal rates across subgroups defined by a sensitive
attribute 𝑆.
In the healthcare example, demographic parity would ask whether Black
and white patients were referred for care at the same rate, regardless of their
underlying health needs. While this might seem fair in terms of equal access, it
ignores real differences in medical status and risk, potentially overcorrecting in
situations where needs are not evenly distributed.
That is, for each true outcome 𝑌 = 𝑦, the model should produce the same pre-
diction distribution across groups 𝑆 = 𝑎 and 𝑆 = 𝑏. This means the model
should behave similarly across groups for individuals with the same true
Chapter 16. Responsible AI 853
Data Collected
Yes
Yes
No
Yes
Yes
are detected, and how updates are governed all influence whether a system can
respond effectively to changing conditions. Responsible design demands that
robustness be treated not as a property of isolated models but as a constraint
that shapes the overall behavior of machine learning systems.
understand and evaluate model behavior within the operational limits of the
deployment environment.
Privacy risks also extend to the serving and monitoring layers. A model with
logging enabled, or one that updates through active learning, may inadvertently
expose sensitive information if logging infrastructure is not privacy-aware. For
example, membership inference attacks can reveal whether a users data was
included in training by analyzing model outputs. Defending against such
attacks requires that privacy-preserving measures extend beyond training and
into interface design, rate limiting, and access control.
Crucially, privacy is not determined solely by technical mechanisms but by
how users experience the system. A model may meet formal privacy definitions
and still violate user expectations if data collection is opaque or explanations are
lacking. Interface design plays a central role: systems must clearly communicate
what data is collected, how it is used, and how users can opt out or revoke
consent. In privacy-sensitive applications, failure to align with user norms can
erode trust even in technically compliant systems.
Architectural decisions thus influence privacy at every stage of the data
lifecycle—from acquisition and preprocessing to inference and monitoring.
Designing for privacy involves not only choosing secure algorithms, but also
making principled tradeoffs based on deployment constraints, user needs,
and legal obligations. In high-resource settings, this may involve centralized
enforcement and policy tooling. In constrained environments, privacy must be
embedded statically in model design and system behavior, often without the
possibility of dynamic oversight.
Privacy is not a feature to be appended after deployment. It is a system-level
property that must be planned, implemented, and validated in concert with
the architectural realities of the deployment environment.
domains may rely more heavily on internal practices, shaped by customer ex-
pectations, reputational concerns, or technical conventions. Regardless of the
setting, governance must be treated as a system-level design property—not an
external policy overlay. It is implemented through the structure of codebases,
deployment pipelines, data flows, and decision interfaces.
Sustaining accountability across diverse deployment environments requires
planning not only for success, but for failure. This includes defining how anoma-
lies are detected, how roles are assigned, how records are maintained, and how
remediation occurs. These processes must be embedded in infrastructure—
traceable in logs, enforceable through interfaces, and resilient to the architec-
tural constraints of the systems deployment context.
Table 16.2: Comparison of key principles across Cloud, Edge, Mobile, and
TinyML deployments.
Principle Cloud ML Edge ML Mobile ML TinyML
Explainability Supports complex Needs lightweight, Requires Severely limited
models and methods low-latency methods interpretable due to constrained
like SHAP and like saliency maps outputs for users, hardware; mostly
sampling approaches often defers deeper static or
analysis to the cloud compile-time only
Fairness Large datasets enable Localized biases High personalization Minimal data
bias detection and harder to detect but complicates limits bias analysis
mitigation allows on-device group-level fairness and mitigation
adjustments tracking
Privacy Centralized data at Sensitive personal Tight coupling to Distributed data
risk of breaches but data on-device user identity reduces
can utilize strong requires on-device requires centralized risks
encryption and protections consent-aware but poses
differential privacy design and local challenges for
methods processing anonymization
Safety Vulnerable to hacking Real-world Operates under user Needs distributed
and large-scale interactions make supervision, but still safety mechanisms
attacks reliability critical requires graceful due to autonomy
failure
Accountabil- Corporate policies Fragmented supply Requires clear Traceability
ity and audits enable chains complicate user-facing required across
traceability and accountability disclosures and long, complex
oversight feedback paths hardware chains
Governance External oversight Requires Balances platform Relies on built-in
and regulations like self-governance by policy with app protocols and
GDPR or CCPA are developers and developer choices cryptographic
feasible integrators assurances
methods show how to detect and mitigate bias, preserve user privacy, improve
robustness, and support interpretability—not as abstract ideals, but as system
behaviors that can be engineered, tested, and maintained. Their effectiveness
depends not only on their theoretical properties, but on how well they align with
practical constraints such as data quality, resource availability, user interaction
models, and deployment architecture.
These methods are not interchangeable or universally applicable. Each in-
troduces tradeoffs involving accuracy, latency, scalability, and implementation
complexity. Choosing the right approach requires understanding the methods
purpose, its assumptions, and the demands it places on the surrounding system.
Moreover, technical interventions must be evaluated not just at the model level,
but across the machine learning lifecycle, including data acquisition, training,
deployment, monitoring, and updating.
This section presents representative techniques for operationalizing respon-
sible AI principles in practice. Each method is introduced with attention to its
role within the system, its typical use cases, and the architectural requirements
it imposes. While no single method ensures responsible behavior in isolation,
together these tools form the foundation for building machine learning systems
that perform reliably and uphold societal and ethical expectations.
different decision thresholds to achieve equal true positive rates. Using a single
threshold across groups leads to disparate outcomes, potentially disadvantag-
ing Subgroup B. Addressing this imbalance by adjusting thresholds per group
may improve fairness, but doing so requires support for conditional logic in
the model serving stack, access to sensitive attributes at inference time, and
a governance framework for explaining and justifying differential treatment
across groups.
75% 81.25%
Moreover, privacy must be enforced not only during training but throughout
the machine learning lifecycle. Retraining pipelines must account for deleted or
revoked data, especially in jurisdictions with data deletion mandates. Monitor-
ing infrastructure must avoid recording personally identifiable information in
logs or dashboards. Privacy-aware telemetry collection, secure enclave deploy-
ment, and per-user audit trails are increasingly used to support these goals,
particularly in applications with strict legal oversight.
Architectural decisions also vary by deployment context. Cloud-based sys-
tems may rely on centralized enforcement of differential privacy, encryption,
and access control, supported by telemetry and retraining infrastructure. In
contrast, edge and TinyML systems must build privacy constraints into the
deployed model itself, often with no runtime configurability or feedback chan-
nel. In such cases, static analysis, conservative design, and embedded privacy
guarantees must be implemented at compile time, with validation performed
prior to deployment.
Ultimately, privacy is not an attribute of a model in isolation but a system-level
property that emerges from design decisions across the pipeline. Responsible
privacy preservation requires that technical safeguards, interface controls, in-
frastructure policies, and regulatory compliance mechanisms work together to
minimize risk throughout the lifecycle of a deployed machine learning system.
Removing Removing
Model Model
learned without the deleted data (Bourtoule et al. 2021). These techniques
are still maturing and may require simplified model architectures, additional
tracking metadata, or compromise on model accuracy and stability. They also
introduce new burdens around verification: how to prove that deletion has
occurred in a meaningful way, especially when internal model state is not fully
interpretable.
The motivation for machine unlearning is reinforced by regulatory frame-
works. Laws such as the General Data Protection Regulation (GDPR), the
California Consumer Privacy Act (CCPA), and similar statutes in Canada and
Japan codify the right to be forgotten, including for data used in model train-
ing. These laws increasingly require not just prevention of unauthorized data
access, but proactive revocation—empowering users to request that their infor-
mation cease to influence downstream system behavior. High-profile incidents
in which generative models have reproduced personal content or copyrighted
data highlight the practical urgency of integrating unlearning mechanisms into
responsible system design.
From a systems perspective, machine unlearning introduces nontrivial ar-
chitectural and operational requirements. Systems must be able to track data
lineage, including which datapoints contributed to a given model version. This
often requires structured metadata capture and training pipeline instrumen-
tation. Additionally, systems must support user-facing deletion workflows,
including authentication, submission, and feedback on deletion status. Verifi-
cation may require maintaining versioned model registries, along with mech-
anisms for confirming that the updated model exhibits no residual influence
from the deleted data. These operations must span data storage, training or-
chestration, model deployment, and auditing infrastructure, and they must be
robust to failure or rollback.
These challenges are amplified in resource-constrained deployments. TinyML
systems typically run on devices with no persistent storage, no connectivity,
and highly compressed models. Once deployed, they cannot be updated or
retrained in response to deletion requests. In such settings, machine unlearning
is effectively infeasible post-deployment and must be enforced during initial
model development through static data minimization and conservative gen-
eralization strategies. Even in cloud-based systems, where retraining is more
Chapter 16. Responsible AI 873
Intrinsically
Interpretable
More Less
Figure 16.7: Spectrum of model
Interpretable Interpretable
interpretability. Inherently inter-
Decision Linear Logistic Random Neural Convolution pretable models (e.g., decision trees,
Trees Regression Regression Forest Network Neural linear regression) are transparent
Network by design, while complex mod-
els (e.g., neural networks, convolu-
tional models) require post hoc ex-
Hybrid approaches aim to combine the representational capacity of deep planation techniques.
models with the transparency of interpretable components. Concept bottleneck
models (Koh et al. 2020), for example, first predict intermediate, interpretable
variables and then use a simple classifier to produce the final prediction. Pro-
toPNet models (C. Chen et al. 2019) classify examples by comparing them to
learned prototypes, offering visual analogies for users to understand predic-
tions. These hybrid methods are attractive in domains that demand partial
transparency, but they introduce new system design considerations, such as
the need to store and index learned prototypes and surface them at inference
time.
A more recent research direction is mechanistic interpretability, which seeks
to reverse-engineer the internal operations of neural networks. This line of
work, inspired by program analysis and neuroscience, attempts to map neurons,
layers, or activation patterns to specific computational functions (Olah et al.
2020; Geiger et al. 2021). Although promising, this field remains exploratory
and is currently most relevant to the analysis of large foundation models where
traditional interpretability tools are insufÏcient.
From a systems perspective, explainability introduces a number of architec-
tural dependencies. Explanations must be generated, stored, surfaced, and
evaluated within system constraints. The required infrastructure may include
explanation APIs, memory for storing attribution maps, visualization libraries,
and logging mechanisms that capture intermediate model behavior. Models
must often be instrumented with hooks or configured to support repeated
evaluations—particularly for explanation methods that require sampling, per-
turbation, or backpropagation.
These requirements interact directly with deployment constraints. For in-
stance, methods like SHAP and LIME involve multiple forward passes or sur-
rogate model fitting and may be impractical in latency-sensitive or resource-
constrained environments such as edge devices or real-time decision systems.
In such settings, systems may rely on approximations such as precomputed
explanations, simplified attribution methods, or fallback rule-based logic. Ex-
plainability must also align with interface capabilities: wearable devices, for
example, may support only brief textual or audio explanations, requiring de-
signers to prioritize clarity and brevity.
Explainability spans the full machine learning lifecycle. During develop-
ment, interpretability tools are used for dataset auditing, concept validation,
and early debugging. At inference time, they support accountability, decision
16.5. Technical Foundations 878
secure manner. This requires telemetry pipelines that capture model version-
ing, input characteristics, prediction confidence, and post-inference feedback.
These logs support drift detection and provide evidence for retrospective audits
of fairness and robustness. Monitoring systems must also be integrated with
alerting, update scheduling, and policy review processes to support timely and
traceable intervention.
Monitoring also supports feedback-driven improvement. For example, re-
peated user disagreement, correction requests, or operator overrides can sig-
nal problematic behavior. This feedback must be aggregated, validated, and
translated into updates to training datasets, data labeling processes, or model
architecture. However, such feedback loops carry risks: biased user responses
can introduce new inequities, and excessive logging can compromise privacy.
Designing these loops requires careful coordination between user experience
design, system security, and ethical governance.
Monitoring mechanisms vary by deployment architecture. In cloud-based
systems, rich logging and compute capacity allow for real-time telemetry, sched-
uled fairness audits, and continuous integration of new data into retraining
pipelines. These environments support dynamic reconfiguration and central-
ized policy enforcement. However, the volume of telemetry may introduce its
own challenges in terms of cost, privacy risk, and regulatory compliance.
In mobile systems, connectivity is intermittent and data storage is limited.
Monitoring must be lightweight and resilient to synchronization delays. Local
inference systems may collect performance data asynchronously and transmit
it in aggregate to backend systems. Privacy constraints are often stricter, par-
ticularly when personal data must remain on-device. These systems require
careful data minimization and local aggregation techniques to preserve privacy
while maintaining observability.
Edge deployments, such as those in autonomous vehicles, smart factories,
or real-time control systems, demand low-latency responses and operate with
minimal external supervision. Monitoring in these systems must be embed-
ded within the runtime, with internal checks on sensor integrity, prediction
confidence, and behavior deviation. These checks often require low-overhead
implementations of uncertainty estimation, anomaly detection, or consistency
validation. System designers must anticipate failure conditions and ensure that
anomalous behavior triggers safe fallback procedures or human intervention.
TinyML systems, which operate on deeply embedded hardware with no
connectivity, persistent storage, or dynamic update path, present the most
constrained monitoring scenario. In these environments, monitoring must
be designed and compiled into the system prior to deployment. Common
strategies include input range checking, built-in redundancy, static failover
logic, or conservative validation thresholds. Once deployed, these models
operate independently, and any post-deployment failure may require physical
device replacement or firmware-level reset.
Despite these differences, the core challenge is universal: deployed ML sys-
tems must not only perform well initially, but continue to behave responsibly
as the environment changes. Monitoring provides the observability layer that
links system performance to ethical goals and accountability structures. With-
out monitoring, fairness and robustness become invisible. Without feedback,
16.6. Sociotechnical and Ethical Systems Considerations 880
umentation tools such as model cards19 and datasheets for datasets20 support
19
this goal by formalizing system metadata in a structured, reproducible format. Model Cards: Tool that pro-
These resources can improve governance, support compliance, and inform user vides essential information about a
expectations. However, transparency as disclosure does not guarantee mean- machine learning model’s capabili-
ingful control. Even when technical details are available, users may lack the ties and biases.
institutional leverage, interface tools, or procedural access to contest a decision 20
Datasheets for Datasets: Docu-
that adversely affects them.
mentation that describes a dataset’s
To move from transparency to contestability, machine learning systems must creation, composition, and intended
be designed with mechanisms for explanation, recourse, and feedback. Expla- use.
nation refers to the capacity of the system to provide understandable reasons
for its outputs, tailored to the needs and context of the person receiving them.
Recourse refers to the ability of individuals to alter their circumstances and re-
ceive a different outcome. Feedback refers to the ability of users to report errors,
dispute outcomes, or signal concerns—and to have those signals incorporated
into system updates or oversight processes.
These mechanisms are often lacking in practice, particularly in systems de-
ployed at scale or embedded in low-resource devices. For example, in mobile
loan application systems, users may receive a rejection without explanation
and have no opportunity to provide additional information or appeal the de-
cision. The lack of transparency at the interface level, even if documentation
exists elsewhere, makes the system effectively unchallengeable. Similarly, a
predictive model deployed in a clinical setting may generate a risk score that
guides treatment decisions without surfacing the underlying reasoning to the
physician. If the model underperforms for a specific patient subgroup, and this
behavior is not observable or contestable, the result may be unintentional harm
that cannot be easily diagnosed or corrected.
From a systems perspective, enabling contestability requires coordination
across technical and institutional components. Models must expose sufÏcient
information to support explanation. Interfaces must surface this information
in a usable and timely way. Organizational processes must be in place to re-
view feedback, respond to appeals, and update system behavior. Logging and
auditing infrastructure must track not only model outputs, but user interven-
tions and override decisions. In some cases, technical safeguards, including
human-in-the-loop overrides and decision abstention thresholds, may also
serve contestability by ensuring that ambiguous or high-risk decisions defer to
human judgment.
The degree of contestability that is feasible varies by deployment context. In
centralized cloud platforms, it may be possible to offer full explanation APIs,
user dashboards, and appeal workflows. In contrast, in edge and TinyML
deployments, contestability may be limited to logging and periodic updates
based on batch-synchronized feedback. In all cases, the design of machine
learning systems must acknowledge that transparency is not simply a matter
of technical disclosure. It is a structural property of systems that determines
whether users and institutions can meaningfully question, correct, and govern
the behavior of automated decision-making.
16.6. Sociotechnical and Ethical Systems Considerations 886
GOVERNMENT REGULATION
ORGANIZATION:
Safety Culture: Independent Oversight:
Organizational Design Auditing Firms,
Insurance Companies,
Management Strategies: NGOs & Civil Society
TEAM: Leadership Commitment, Professional Societies
Reliable Systems: Hiring & Training,
Software Engineering Failures & Near Misses,
Internal Reviews
Technical Practices: Industry Standards
Audit Trails, SE Workflows
Verification & Bias Testing
Explainable UIs
parties agree on the need for improvement, the logistical and operational costs
can be prohibitive.
Efforts to collect more representative data may also run into ethical and
political concerns. In some cases, additional data collection could expose
marginalized populations to new risks. This paradox of exposure, in which the
individuals most harmed by exclusion are also those most vulnerable to misuse,
complicates efforts to improve fairness through dataset expansion. For example,
gathering more data on non-binary individuals to support fairness in gender-
sensitive applications may improve model coverage, but it also raises serious
concerns around consent, identifiability, and downstream use. Teams must
navigate these tensions carefully, often without clear institutional guidance.
Even when data is plentiful, upstream biases in data collection systems
can persist unchecked. Many organizations rely on third-party data vendors,
external APIs, or operational databases that were not designed with fairness or
24 interpretability in mind. For instance, Electronic Health Records24 , which are
Electronic Health Records
commonly used in clinical machine learning, often reflect systemic disparities
(EHR): Digital versions of patients’
in care, as well as documentation habits that encode racial or socioeconomic
medical histories, used extensively
bias (Himmelstein, Bates, and Zhou 2022). Teams working downstream may
in healthcare for data analysis and
have little visibility into how these records were created, and few levers for
predictive modeling.
addressing embedded harms.
Improving dataset quality is often not the responsibility of any one team.
Data pipelines may be maintained by infrastructure or analytics groups that
operate independently of the ML engineering or model evaluation teams. This
organizational fragmentation makes it difÏcult to coordinate data audits, track
provenance, or implement feedback loops that connect model behavior to
underlying data issues. In practice, responsibility for dataset quality tends
to fall through the cracks—recognized as important, but rarely prioritized or
resourced.
Addressing these challenges requires long-term investment in infrastructure,
workflows, and cross-functional communication. Technical tools such as data
validation, automated audits, and dataset documentation frameworks (e.g.,
model cards, datasheets, or the Data Nutrition Project) can help, but only when
they are embedded within teams that have the mandate and support to act on
their findings. Ultimately, improving data quality is not just a matter of better
tooling—it is a question of how responsibility for data is assigned, shared, and
sustained across the system lifecycle.
turnover and team restructuring can erode institutional memory. Teams re-
sponsible for maintaining a deployed model may not be the ones who originally
developed or audited it, leading to unintentional misalignment between sys-
tem goals and current implementation. These issues are especially acute in
continual or streaming learning scenarios, where concept drift27 and shifting
27
data distributions demand active monitoring and real-time updates. Concept drift occurs when
These challenges are magnified in multi-model systems and cross-platform the statistical properties of the tar-
deployments. A recommendation engine may consist of dozens of interacting get variable change over time in un-
models, each optimized for a different subtask or user segment. A voice assistant foreseen ways.
deployed across mobile and edge environments may maintain different versions
of the same model, tuned to local hardware constraints. Coordinating updates,
ensuring consistency, and sustaining responsible behavior in such distributed
systems requires infrastructure that tracks not only code and data, but also
values and constraints.
Addressing scalability and maintenance challenges requires treating respon-
sible AI as a lifecycle property, not a one-time evaluation. This means embed-
ding audit hooks, metadata tracking, and monitoring protocols into system
infrastructure. It also means creating documentation that persists across team
transitions, defining accountability structures that survive project handoffs,
and ensuring that system updates do not inadvertently erase hard-won im-
provements in fairness, transparency, or safety. While such practices can be
difÏcult to implement retroactively, they can be integrated into system design
from the outset through responsible-by-default tooling and workflows.
Ultimately, responsibility must scale with the system. Machine learning mod-
els deployed in real-world environments must not only meet ethical standards
at launch, but continue to do so as they grow in complexity, user reach, and
operational scope. Achieving this requires sustained organizational investment
and architectural planning—not simply technical correctness at a single point
in time.
mographic parity may not meet the requirements of equalized odds in another
domain or jurisdiction. Without shared standards, these evaluations remain ad
hoc, making it difÏcult to establish confidence in a systems responsible behavior
across contexts.
Responsible AI evaluation also suffers from a mismatch between the unit of
analysis, which is frequently the individual model or batch job, and the level
of deployment, which includes end-to-end system components such as data
ingestion pipelines, feature transformations, inference APIs, caching layers, and
human-in-the-loop workflows. A system that appears fair or interpretable in
isolation may fail to uphold those properties once integrated into a broader
application. Tools that support holistic, system-level evaluation remain under-
developed, and there is little guidance on how to assess responsibility across
interacting components in modern ML stacks.
Further complicating matters is the lack of lifecycle-aware metrics. Most eval-
uation tools are applied at a single point in time—often just before deployment.
Yet responsible AI properties such as fairness and robustness are dynamic.
They depend on how data distributions evolve, how models are updated, and
how users interact with the system. Without continuous or periodic evaluation,
it is difÏcult to determine whether a system remains aligned with its intended
ethical goals after deployment. Post-deployment monitoring tools exist, but
they are rarely integrated with the development-time metrics used to assess
initial model quality. This disconnect makes it hard to detect drift in ethical
performance, or to trace observed harms back to their upstream sources.
Tool fragmentation further contributes to these challenges. Responsible AI
tooling is often distributed across disconnected packages, dashboards, or in-
ternal systems, each designed for a specific task or metric. A team may use
one tool for explainability, another for bias detection, and a third for compli-
ance reporting—with no unified interface for reasoning about system-level
tradeoffs. The lack of interoperability hinders collaboration between teams,
complicates documentation, and increases the risk that important evaluations
will be skipped or performed inconsistently. These challenges are compounded
by missing hooks for metadata propagation or event logging across components
like feature stores, inference gateways, and model registries.
Addressing these gaps requires progress on multiple fronts. First, shared
evaluation frameworks must be developed that define what it means for a
system to behave responsibly—not just in abstract terms, but in measurable,
auditable criteria that are meaningful across domains. Second, evaluation
must be extended beyond individual models to cover full system pipelines,
including user-facing interfaces, update policies, and feedback mechanisms.
Finally, evaluation must become a recurring lifecycle activity, supported by
infrastructure that tracks system behavior over time and alerts developers when
ethical properties degrade.
Without standardized, system-aware evaluation methods, responsible AI
remains a moving target—described in principles but difÏcult to verify in
practice. Building confidence in machine learning systems requires not only
better models and tools, but shared norms, durable metrics, and evaluation
practices that reflect the operational realities of deployed AI.
Chapter 16. Responsible AI 895
True Goal:
Maximize User Satisfaction
Figure 16.9: Misalignment between
a recommender system’s true objec- Intended Objective
tive and its optimized reward func-
tion. The model optimizes for click- Agent: Optimized Instead
through rate (a proxy for satisfac- ML Recommender System
tion), leading to unintended behav-
iors such as clickbait and misinfor-
mation. These behaviors are rein-
forced through feedback, illustrat-
ing the problem of reward hacking Behavior:
and the challenge of aligning ML Promote Clickbait or Addictive Content
systems with human values.
better be quite sure that the purpose put into the machine is the purpose which
we desire” (Wiener 1960).
As the capabilities of deep learning models have increasingly approached,
and, in certain instances, exceeded, human performance, the concern that
such systems may pursue unintended or undesirable goals has become more
pressing (S. Russell 2021). Within the field of AI safety, a central focus is the
problem of value alignment: how to ensure that machine learning systems act
in accordance with broad human intentions, rather than optimizing misaligned
proxies or exhibiting emergent behavior that undermines social goals. As
Russell argues in Human-Compatible Artificial Intelligence, much of current
AI research presumes that the objectives to be optimized are known and fixed,
focusing instead on the effectiveness of optimization rather than the design of
objectives themselves.
Yet defining “the right purpose” for intelligent systems is especially dif-
ficult in real-world deployment settings. ML systems often operate within
dynamic environments, interact with multiple stakeholders, and adapt over
time. These conditions make it challenging to encode human values in static
objective functions or reward signals. Frameworks like Value Sensitive Design
aim to address this challenge by providing formal processes for eliciting and
integrating stakeholder values during system design.
Taking a holistic sociotechnical perspective, which accounts for both the
algorithmic mechanisms and the contexts in which systems operate, is essential
for ensuring alignment. Without this, intelligent systems may pursue narrow
performance objectives (e.g., accuracy, engagement, or throughput) while pro-
ducing socially undesirable outcomes. Achieving robust alignment under such
conditions remains an open and critical area of research in ML systems.
The absence of alignment can give rise to well-documented failure modes,
particularly in systems that optimize complex objectives. In reinforcement
learning (RL), for example, models often learn to exploit unintended aspects
of the reward function—a phenomenon known as specification gaming or
reward hacking. Such failures arise when variables not explicitly included in
Chapter 16. Responsible AI 897
the objective are manipulated in ways that maximize reward while violating
human intent.
A particularly influential approach in recent years has been reinforcement
learning from human feedback (RLHF), where large pre-trained models are
fine-tuned using human-provided preference signals (Christiano et al. 2017).
While this method improves alignment over standard RL, it also introduces new
risks. Ngo (Ngo, Chan, and Mindermann 2022) identifies three potential failure
modes introduced by RLHF: (1) situationally aware reward hacking, where
models exploit human fallibility; (2) the emergence of misaligned internal goals
that generalize beyond the training distribution; and (3) the development of
power-seeking behavior that preserves reward maximization capacity, even at
the expense of human oversight.
These concerns are not limited to speculative scenarios. Amodei et al. (2016)
outline six concrete challenges for AI safety: (1) avoiding negative side effects
during policy execution, (2) mitigating reward hacking, (3) ensuring scalable
oversight when ground-truth evaluation is expensive or infeasible, (4) designing
safe exploration strategies that promote creativity without increasing risk, (5)
achieving robustness to distributional shift in testing environments, and (6)
maintaining alignment across task generalization. Each of these challenges
becomes more acute as systems are scaled up, deployed across diverse settings,
and integrated with real-time feedback or continual learning.
ening.” These perceptions, though decades old, remain relevant in the age of
machine learning systems. As the pace of innovation accelerates, responsible
AI development must be accompanied by clear and accurate scientific commu-
nication, especially concerning the capabilities, limitations, and uncertainties
of AI technologies.
As modern AI systems surpass layperson understanding and begin to influ-
ence high-stakes decisions, public narratives tend to polarize between utopian
and dystopian extremes. This is not merely a result of media framing, but of a
more fundamental difÏculty: in technologically advanced societies, the outputs
of scientific systems are often perceived as magical—“understandable only in
terms of what it did, not how it worked” (Handlin 1965). Without scaffolding for
technical comprehension, systems like generative models, autonomous agents,
or large-scale recommender platforms can be misunderstood or mistrusted,
impeding informed public discourse.
Tech companies bear responsibility in this landscape. Overstated claims,
anthropomorphic marketing, or opaque product launches contribute to cycles
of hype and disappointment, eroding public trust. But improving AI literacy
requires more than restraint in corporate messaging. It demands systematic
research on scientific communication in the context of AI. Despite the soci-
etal impact of modern machine learning, an analysis of the Scopus scholarly
database found only a small number of papers that intersect the domains of
“artificial intelligence” and “science communication” (Schäfer 2023).
Addressing this gap requires attention to how narratives about AI are shaped—
not just by companies, but also by academic institutions, regulators, journalists,
non-profits, and policy advocates. The frames and metaphors used by these
actors significantly influence how the public perceives agency, risk, and control
in AI systems (Lindgren 2023). These perceptions, in turn, affect adoption, over-
sight, and resistance, particularly in domains such as education, healthcare, and
employment, where AI deployment intersects directly with lived experience.
From a systems perspective, public understanding is not an externality—
it is part of the deployment context. Misinformation about how AI systems
function can lead to overreliance, misplaced blame, or underutilization of safety
mechanisms. Equally, a lack of understanding of model uncertainty, data bias,
or decision boundaries can exacerbate the risks of automation-induced harm.
For individuals whose jobs are impacted by AI, targeted efforts to build domain-
specific literacy can also support reskilling and adaptation (Ng et al. 2021).
Ultimately, AI literacy is not just about technical fluency. It is about building
public confidence that the goals of system designers are aligned with societal
welfare—and that those building AI systems are not removed from public
values, but accountable to them. As Handlin observed in 1965: “Even those who
never acquire that understanding need assurance that there is a connection between the
goals of science and their welfare, and above all, that the scientist is not a man altogether
apart but one who shares some of their value.”
16.9 Conclusion
Responsible artificial intelligence is essential as machine learning systems in-
creasingly shape decisions in healthcare, employment, finance, and the justice
Chapter 16. Responsible AI 901
16.10 Resources
Slides
çĖ Videos
• Coming soon.
16.10. Resources 902
¸Î Exercises
• Coming soon.
Chapter 17
Sustainable AI
Purpose
How do environmental considerations influence the design and implementation
of machine learning systems, and what principles emerge from examining AI
through an ecological perspective?
Machine learning systems inherently require significant computational re-
sources, raising critical concerns about their environmental impact. Addressing
these concerns requires a deep understanding of how architectural decisions
affect energy consumption, resource utilization, and ecological sustainability.
Designers and engineers must consider the relationships between computa-
tional demands, resource utilization, and environmental consequences across
various system components. A systematic exploration of these considerations
helps identify key architectural principles and design strategies that harmonize
performance objectives with ecological stewardship.
903
17.1. Overview 904
L Learning Objectives
17.1 Overview
Machine learning has become an essential driver of technological progress,
powering advancements across industries and scientific domains. However,
as AI models grow in complexity and scale, the computational demands re-
quired to train and deploy them have increased significantly, raising critical
concerns about sustainability. The environmental impact of AI extends beyond
energy consumption, encompassing carbon emissions, resource extraction, and
electronic waste. As a result, it is imperative to examine AI systems through
the lens of sustainability and assess the trade-offs between performance and
ecological responsibility.
Developing large-scale AI models, such as state-of-the-art language and
vision models, requires substantial computational power. Training a single
large model can consume thousands of megawatt-hours of electricity, equivalent
to powering hundreds of households for a month. Much of this energy is
supplied by data centers, which rely heavily on nonrenewable energy sources,
contributing to global carbon emissions. Estimates indicate that AI-related
emissions are comparable to those of entire industrial sectors, highlighting
the urgency of transitioning to more energy-efÏcient models and renewable-
powered infrastructure.
Beyond energy consumption, AI systems also impact the environment through
hardware manufacturing and resource utilization. Training and inference work-
loads depend on specialized processors, such as GPUs and TPUs, which require
rare earth metals whose extraction and processing generate significant pollution.
Additionally, the growing demand for AI applications accelerates electronic
waste production, as hardware rapidly becomes obsolete. Even small-scale AI
systems, such as those deployed on edge devices, contribute to sustainability
challenges, necessitating careful consideration of their lifecycle impact.
This chapter examines the sustainability challenges associated with AI sys-
tems and explores emerging solutions to mitigate their environmental footprint.
It discusses strategies for improving algorithmic efÏciency, optimizing train-
ing infrastructure, and designing energy-efÏcient hardware. Additionally, it
considers the role of renewable energy sources, regulatory frameworks, and in-
dustry best practices in promoting sustainable AI development. By addressing
these challenges, the field can advance toward more ecologically responsible
AI systems while maintaining technological progress.
Chapter 17. Sustainable AI 905
Autonomy
Group privacy
Moral responsibility
Distributed responsibility
Ethical auditing
Power Demand
Efficiency Gains Source: Masanet et al. (2020), Cisco,
Increasing
IEA, Goldman Sachs Global Invest-
500
10
250
0 0
underscore the need for more sustainable AI practices to mitigate the industry’s
carbon impact.
0.50
0.00
LM
RM-1
RM-2
RM-3
RM-4
RM-5
BERT-NAS
Evolved Transformer
T5
Meena
GShard-600B
Switch Transformer
GPT-3
Facebook OSS Large-Scale ML Models
*Training footprint only
annual AI energy needs up to 1,000 times by 2030. So, while model optimization
tackles one facet, responsible innovation must also consider total lifecycle costs
at global deployment scales that were unfathomable just years ago but now
pose infrastructure and sustainability challenges ahead.
17.3.3.1 Scope 1
Scope 1 emissions refer to direct greenhouse gas emissions produced by AI data
centers and computing facilities. These emissions result primarily from on-site
power generation, including backup diesel generators used to ensure reliability
in large cloud environments, as well as facility cooling systems. Although many
AI data centers predominantly rely on grid electricity, those with their own
power plants or fossil-fuel-dependent backup systems contribute significantly
to direct emissions, especially in regions where renewable energy sources are
less prevalent (Masanet et al. 2020a).
17.3.3.2 Scope 2
Scope 2 emissions encompass indirect emissions from electricity purchased to
power AI infrastructure. The majority of AI’s operational energy consumption
falls under Scope 2, as cloud providers and enterprise computing facilities
require massive electrical inputs for GPUs, TPUs, and high-density servers. The
carbon intensity associated with Scope 2 emissions varies geographically based
on regional energy mixes. Regions dominated by coal and natural gas electricity
generation create significantly higher AI-related emissions compared to regions
utilizing renewable sources such as wind, hydro, or solar. This geographic
variability motivates companies to strategically position data centers in areas
17.3. AI Carbon Footprint 914
17.3.3.3 Scope 3
Scope 3 emissions constitute the largest and most complex category, captur-
ing indirect emissions across the entire AI supply chain and lifecycle. These
emissions originate from manufacturing, transportation, and disposal of AI
hardware, particularly semiconductors and memory modules. Semiconductor
manufacturing is particularly energy-intensive, involving complex processes
such as chemical etching, rare-earth metal extraction, and extreme ultraviolet
(EUV) lithography, all of which produce substantial carbon outputs. Indeed,
manufacturing a single high-performance AI accelerator can generate emis-
sions equivalent to several years of operational energy use (U. Gupta, Kim, et
al. 2022).
Beyond manufacturing, Scope 3 emissions include the downstream impact of
AI once deployed. AI services such as search engines, social media platforms,
and cloud-based recommendation systems operate at enormous scale, requiring
continuous inference across millions or even billions of user interactions. The
cumulative electricity demand of inference workloads can ultimately surpass
the energy used for training, further amplifying AI’s carbon impact. End-user
devices, including smartphones, IoT devices, and edge computing platforms,
also contribute to Scope 3 emissions, as their AI-enabled functionality depends
on sustained computation. Companies such as Meta and Google report that
Scope 3 emissions from AI-powered services make up the largest share of their
total environmental footprint, due to the sheer scale at which AI operates.
These massive facilities provide the infrastructure for training complex neural
networks on vast datasets. For instance, based on leaked information, OpenAI’s
language model GPT-4 was trained on Azure data centers packing over 25,000
Nvidia A100 GPUs, used continuously for over 90 to 100 days.
The GHG Protocol framework, illustrated in Figure 17.7, provides a struc-
tured way to visualize the sources of AI-related carbon emissions. Scope 1
emissions arise from direct company operations, such as data center power
generation and company-owned infrastructure. Scope 2 covers electricity pur-
chased from the grid, the primary source of emissions for cloud computing
workloads. Scope 3 extends beyond an organization’s direct control, including
emissions from hardware manufacturing, transportation, and even the end-user
energy consumption of AI-powered services. Understanding this breakdown
allows for more targeted sustainability strategies, ensuring that efforts to re-
duce AI’s environmental impact are not solely focused on energy efÏciency but
also address the broader supply chain and lifecycle emissions that contribute
significantly to the industry’s carbon footprint.
4-5 4-5
4-4.5
1-1.5
–1
< 0.1 < 0.1
However, running inference at the edge does not eliminate energy concerns—
especially when AI is deployed at scale. Autonomous vehicles, for instance,
require millisecond-latency AI inference, meaning cloud processing is imprac-
tical. Instead, vehicles are now being equipped with onboard AI accelerators
that function as “data centers on wheels (Sudhakar, Sze, and Karaman 2023).
These embedded computing systems process real-time sensor data equivalent
to small data centers, consuming significant power even without relying on
cloud inference.
Similarly, consumer devices such as smartphones, wearables, and IoT sen-
sors individually consume relatively little power but collectively contribute
significantly to global energy use due to their sheer numbers. Therefore, the
efÏciency benefits of edge computing must be balanced against the extensive
scale of device deployment.
of its water, significantly reducing its environmental footprint (see Figure 17.9
showing the typical semiconductor fab water cycle).
critical material, with global supplies expected to last fewer than 15 years at
the current rate of consumption (M. Davies 2011).
Another major concern is helium, a noble gas critical for semiconductor
cooling, plasma etching, and EUV lithography4 used in next-generation chip
4
production. Helium is unique in that once released into the atmosphere, it Extreme ultraviolet (EUV)
escapes Earth’s gravity and is lost forever, making it a non-renewable resource lithography: A cutting-edge
(M. Davies 2011). The semiconductor industry is one of the largest consumers of semiconductor manufacturing
helium, and supply shortages have already led to price spikes and disruptions technique that uses EUV light
in fabrication processes. As AI hardware manufacturing scales, the demand to etch nanoscale features on
for helium will continue to grow, necessitating more sustainable extraction and silicon wafers. EUV lithography is
recycling practices. essential for producing advanced
Beyond raw material availability, the geopolitical control of rare earth ele- AI chips with smaller transistors
ments poses additional challenges. China currently dominates over 90% of and higher performance.
the world’s rare earth element (REE) refining capacity, including materials
essential for AI chips, such as neodymium (for high-performance magnets in
AI accelerators) and yttrium (for high-temperature superconductors) (A. R. Jha
2014). This concentration of supply creates supply chain vulnerabilities, as
trade restrictions or geopolitical tensions could severely impact AI hardware
production.
Table 17.1 highlights the key materials essential for AI semiconductor manu-
facturing, their applications, and supply concerns.
Table 17.1: Rare materials that are widely used in the semiconductor industry
that are facing resource depletion.
Application in AI Semiconductor
Material Manufacturing Supply Concerns
Silicon (Si) Primary substrate for chips, wafers, Processing constraints; geopolitical risks
transistors
Gallium (Ga) GaN-based power amplifiers, Limited availability; byproduct of aluminum and
high-frequency components zinc production
Germanium (Ge) High-speed transistors, photodetectors, Scarcity; geographically concentrated
optical interconnects
Indium (In) Indium Tin Oxide (ITO), optoelectronics Limited reserves; recycling dependency
Tantalum (Ta) Capacitors, stable integrated Conflict mineral; vulnerable supply chains
components
Rare Earth Magnets, sensors, high-performance High geopolitical risks; environmental extraction
Elements (REEs) electronics concerns
Cobalt (Co) Batteries for edge computing devices Human rights issues; geographical concentration
(Congo)
Tungsten (W) Interconnects, barriers, heat sinks Limited production sites; geopolitical concerns
Copper (Cu) Interconnects, barriers, heat sinks Limited high-purity sources; geopolitical
concerns
Helium (He) Semiconductor cooling, plasma etching, Non-renewable; irretrievable atmospheric loss;
EUV lithography limited extraction capacity
Indium (In) ITO layers, optoelectronic components Limited global reserves; geopolitical
concentration
Cobalt (Co) Batteries for edge computing devices Geographical concentration; human rights
concerns
Tungsten (W) Interconnects, heat sinks Limited production sites; geopolitical concerns
Copper (Cu) Conductive pathways, wiring Geopolitical dependencies; limited recycling
capacity
The rapid growth of AI and semiconductor demand has accelerated the de-
pletion of these critical resources, creating an urgent need for material recycling,
substitution strategies, and more sustainable extraction methods. Some efforts
17.4. Beyond Carbon 922
rent systems capture only 17.4% of global e-waste, leaving the majority to be
discarded in landfills or improperly processed (Singh and Ogunseitan 2022).
Addressing the hazardous waste impact of AI requires advancements in both
semiconductor manufacturing and e-waste recycling. Companies are exploring
closed-loop recycling for rare metals, improved chemical treatment processes,
and alternative materials with lower toxicity. However, as AI models continue
to drive demand for higher-performance chips and larger-scale computing
infrastructure, the industry’s ability to manage its waste footprint will be a key
factor in achieving sustainable AI development.
AI System
Life Cycle Analysis
The following sections will analyze each lifecycle phase in detail, exploring
its specific environmental impacts and sustainability challenges.
Chapter 17. Sustainable AI 925
Table 17.2: Estimated carbon emissions associated with training various AI mod-
els, based on computational requirements and energy consumption.
Source: Adapted from (D. Patterson, Gonzalez, Holzle, et al. 2022;
Strubell, Ganesh, and McCallum 2019b).
Training Compute Equivalent Car Miles
AI Model (FLOPs) Estimated CO2 Emissions (kg) Driven
The demand for water in semiconductor fabs has also raised concerns about
regional water stress. The TSMC5 fab in Arizona is projected to consume 8.9
5
Taiwan Semiconductor Manu- million gallons per day, a figure that accounts for nearly 3% of the city’s water
facturing Company (TSMC) is one supply. While some fabs have begun investing in water recycling systems, these
of the world’s largest semiconduc- efforts remain insufÏcient to offset the growing demand.
tor fabs, consuming millions of gal-
lons of water daily in chip produc- 17.5.2.4 Sustainable Initiatives
tion, raising concerns about water
scarcity. Recognizing the sustainability challenges of semiconductor manufacturing,
industry leaders have started implementing initiatives to reduce energy con-
sumption, waste generation, and emissions. Companies like Intel, TSMC, and
Samsung have pledged to transition towards carbon-neutral semiconductor
fabrication through several key approaches. Many fabs are incorporating renew-
able energy sources, with facilities in Taiwan and Europe increasingly powered
by hydroelectric and wind energy. Water conservation efforts have expanded
through closed-loop recycling systems that reduce dependence on local water
supplies. Manufacturing processes are being redesigned with eco-friendly
etching and lithography techniques that minimize hazardous waste generation.
Additionally, companies are developing energy-efÏcient chip architectures,
such as low-power AI accelerators optimized for performance per watt, to re-
duce the environmental impact of both manufacturing and operation. Despite
these efforts, the overall environmental footprint of AI chip manufacturing
continues to grow as demand for AI accelerators escalates. Without significant
improvements in material efÏciency, recycling, and fabrication techniques, the
manufacturing phase will remain a major contributor to AI’s sustainability
challenges.
The manufacturing phase of AI hardware represents one of the most resource-
intensive and environmentally impactful aspects of AI’s lifecycle. The extraction
of critical materials, high-energy fabrication processes, and hazardous waste
generation all contribute to AI’s growing carbon footprint. While industry
efforts toward sustainable semiconductor manufacturing are gaining momen-
tum, scaling these initiatives to meet rising AI demand remains a significant
challenge.
Addressing the sustainability of AI hardware will require a combination
of material innovation, supply chain transparency, and greater investment
in circular economy models that emphasize chip recycling and reuse. As AI
systems continue to advance, their long-term viability will depend not only
on computational efÏciency but also on reducing the environmental burden of
their underlying hardware infrastructure.
highlights the core of Jevon’s Paradox in AI: efÏciency alone is not sufÏcient to
guarantee sustainability.
Cost of
AI Services
Figure 17.12: Jevon’s Paradox in AI
efÏciency improvements and usage.
A
Savings Demand Response
50% Drop from
in Costs reduced
Curve for AI Usage
AI costs
B
AI Usage
C Consumption of Teach more than D
doubles total costs are higher
However, Jevon’s Paradox suggests that even highly efÏcient data centers
could contribute to increased consumption if they enable a massive expansion of
AI-driven services. Optimizing the energy efÏciency of data centers is critical to
reducing the environmental impact of AI, but efÏciency alone is not enough. We
must also consider strategies for limiting the growth of data center capacity. The
integration of renewable energy, the adoption of advanced cooling solutions,
and the use of AI-driven optimizations can significantly decrease the carbon
footprint of AI infrastructure. As AI continues to scale, these innovations will
play a central role in ensuring that machine learning remains aligned with
sustainability goals.
source that produces no direct carbon emissions but lacks the flexibility to
accommodate renewable energy variability. Tech companies like Microsoft
have shown interest in nuclear energy to power their data centers, as their more
constant demand profile (compared to residential use) aligns well with nuclear
generation characteristics.
Beyond scheduling, optimizing inference sustainability requires complemen-
tary hardware and software innovations. Model quantization techniques enable
lower-precision arithmetic to significantly cut power consumption without sac-
rificing accuracy (A. et al. Gholami 2021). Knowledge distillation methods
allow compact, energy-efÏcient models to replicate the performance of larger,
resource-intensive networks (Hinton, Vinyals, and Dean 2015b). Coupled with
specialized inference accelerators like Google’s TPUs, these approaches sub-
stantially reduce inference’s environmental impact.
Software frameworks specifically designed for energy efÏciency also play
a crucial role. Energy-aware AI frameworks, such as Zeus (You, Chung, and
Chowdhury 2023) and Perseus (Chung et al. 2023), balance computational
speed and power efÏciency during both training and inference. These platforms
optimize model execution by analyzing trade-offs between speed and energy
consumption, facilitating widespread adoption of energy-efÏcient AI strategies,
particularly for inference operations that must run continuously at scale.
30 29.42
27.31
Figure 17.14: Number of Inter-
25.21
net of Things (IoT) connected de-
Connected devices in billions
25 23.14
vices worldwide from 2019 to 2023.
21.09
Source: Statista.
20 19.08
17.8
15.14
15 13.14
11.28
9.76
10 8.6
0
2019 2020 2021 2022 2023 2024* 2025* 2026* 2027* 2028* 2029* 2030*
While AI-powered data centers have been scrutinized for their carbon foot-
print and energy demands, far less attention has been paid to the environmental
cost of embedding AI into billions of short-lived devices. Addressing this chal-
lenge requires rethinking how AI hardware is designed, manufactured, and
disposed of, ensuring that edge AI systems contribute to technological progress
without leaving behind an unsustainable legacy of waste.
can mitigate the negative environmental impacts of AI at the edge while still
enabling technological progress.
This section explores the various policy tools available for mitigating AI’s
environmental impact, analyzing the role of governments, regulatory bod-
ies, and industry-led efforts. By examining both mandatory and voluntary
approaches, we assess how regulations can drive AI sustainability without
impeding technological progress.
data centers and cloud platforms can help ease this burden while still improving
visibility into AI’s environmental footprint.
To be most constructive, measurement and reporting policies should focus on
enabling continuous refinement rather than imposing simplistic restrictions or
rigid caps. Given AI’s rapid evolution, regulations that incorporate flexibility
while embedding sustainability into evaluation metrics will be most effective in
driving meaningful reductions in energy consumption and emissions. Rather
than stifling innovation, well-designed policies can encourage AI developers to
prioritize efÏciency from the outset, fostering a culture of responsible AI design
that aligns with long-term sustainability goals.
17.8.4 Self-Regulation
While government policies play a crucial role in shaping sustainable AI prac-
tices, the AI industry itself has the power to drive significant environmental
improvements through self-regulation. Many leading AI companies and re-
search organizations have already adopted voluntary commitments to reduce
their carbon footprints, improve energy efÏciency, and promote sustainable
development. These efforts can complement regulatory policies and, in some
cases, even set higher standards than those mandated by governments.
One of the most visible self-regulation strategies is the commitment by major
AI companies to operate on renewable energy. Companies like Google, Mi-
crosoft, Amazon, and Meta have pledged to procure enough clean energy to
match 100% of their electricity consumption. Google has gone further by aiming
for 24/7 Carbon-Free Energy by ensuring that its data centers run exclusively
on renewables every hour of every day. These commitments not only reduce
operational emissions but also create market demand for renewable energy,
accelerating the transition to a greener grid. However, as seen with the use of
Chapter 17. Sustainable AI 959
Another key area of global concern is AI hardware supply chains and elec-
tronic waste management. The production of AI accelerators, GPUs, and data
center hardware depends on a complex network of raw material extraction,
semiconductor fabrication, and electronic assembly spanning multiple conti-
nents. The environmental impact of this supply chain, which includes rare-earth
mineral mining in Africa, chip manufacturing in Taiwan, and final assembly in
China, often falls outside the jurisdiction of AI companies themselves. This un-
derscores the need for international agreements on sustainable semiconductor
production, responsible mining practices, and e-waste recycling policies.
The Basel Convention14 , which regulates hazardous waste exports, could
14
provide a model for addressing AI-related e-waste challenges at a global scale. Basel Convention: An inter-
The convention restricts the transfer of toxic electronic waste from developed national treaty regulating the trans-
nations to developing countries, where unsafe recycling practices can harm boundary movement of hazardous
workers and pollute local ecosystems. Expanding such agreements to cover waste to prevent its disposal in coun-
AI-specific hardware components, such as GPUs and inference chips, could tries with weaker environmental
ensure that end-of-life disposal is handled responsibly rather than outsourced protections.
to regions with weaker environmental protections.
International collaboration in AI sustainability is not just about mitigating
harm but also leveraging AI as a tool for environmental progress. AI models are
already being deployed for climate forecasting, renewable energy optimization,
and precision agriculture, demonstrating their potential to contribute to global
sustainability goals. Governments, research institutions, and industry leaders
must align on best practices for scaling AI solutions that support climate action,
ensuring that AI is not merely a sustainability challenge but also a powerful
tool for global environmental resilience.
Ultimately, sustainable AI requires a coordinated global approach that in-
tegrates regulatory alignment, standardized sustainability reporting, energy
decarbonization, supply chain accountability, and responsible e-waste man-
agement. Without such collaboration, regional disparities in AI governance
could hinder meaningful progress, allowing inefÏciencies and externalized
environmental costs to persist. As AI continues to evolve, establishing global
frameworks that balance technological advancement with environmental re-
sponsibility will be critical in shaping an AI-driven future that is not only
intelligent but also sustainable.
17.9.1 AI Awareness
Public understanding of AI and its role in sustainability remains limited, often
shaped by media narratives that highlight either its transformative potential
or its risks. Surveys such as the Pew Research Center poll found that while
a majority of people have heard of AI, their understanding of its specific ap-
plications, especially in the context of sustainability, remains shallow. Many
associate AI with automation, recommendation systems, or chatbots but may
not be aware of its broader implications in climate science, energy optimization,
and environmental monitoring.
A key factor influencing public perception is the framing of AI’s sustainability
contributions. Optimistic portrayals emphasize AI’s ability to enhance renew-
able energy integration, improve climate modeling accuracy, and enable smart
infrastructure for reduced emissions. Organizations such as Climate Change
AI actively promote AI’s potential in environmental applications, fostering a
positive narrative. Conversely, concerns about AI’s energy-intensive training
processes, ethical considerations, and potential biases contribute to skepticism.
Studies analyzing public discourse on AI sustainability reveal an even split
between optimism and caution, with some fearing that AI’s environmental
costs may outweigh its benefits.
In many cases, public attitudes toward AI-driven sustainability efforts are
shaped by trust in institutions. AI systems deployed by reputable environmental
organizations or in collaboration with scientific communities tend to receive
more favorable reception. However, corporate-led AI sustainability initiatives
often face skepticism, particularly if they are perceived as greenwashing—
a practice where companies exaggerate their commitment to environmental
responsibility without substantial action.
To foster informed public engagement, increasing AI literacy is crucial. This
involves education on AI’s actual energy consumption, potential for optimiza-
tion, and real-world applications in sustainability. Universities, research institu-
tions, and industry leaders can play a pivotal role in making AI’s sustainability
impact more accessible to the general public through open reports, interactive
tools, and clear communication strategies.
Chapter 17. Sustainable AI 963
and the Partnership on AI have taken steps to provide resources and guidance
on using AI for environmental applications, but more widespread efforts are
needed to democratize access.
Funding mechanisms also play a critical role in determining who benefits
from AI-driven sustainability. While large corporations and well-funded re-
search institutions can afford to invest in AI-powered environmental solutions,
smaller organizations often lack the necessary financial resources. Government
grants, philanthropic funding, and international AI-for-good initiatives could
help ensure that grassroots sustainability efforts can leverage AI technologies.
For instance, Spain has allocated 300 million euros specifically for AI and sus-
tainability projects, setting a precedent for public investment in environmentally
responsible AI innovation. Expanding such funding models globally could
foster more inclusive AI adoption.
Beyond technical and financial barriers, policy interventions are necessary to
ensure that AI sustainability efforts are equitably distributed. Without regula-
tory frameworks that prioritize inclusion, AI-driven environmental solutions
may disproportionately benefit regions with existing technological advantages
while neglecting areas with the most pressing sustainability challenges. Gov-
ernments and international bodies should establish policies that encourage
equitable AI adoption, such as requiring AI sustainability projects to consider
social impact assessments or mandating transparent reporting on AI-driven
environmental initiatives.
Ensuring equitable access to AI for sustainability is not merely a technical
challenge but a fundamental issue of environmental justice. As AI continues to
shape global sustainability efforts, proactive measures must be taken to prevent
technology from reinforcing existing inequalities. By investing in AI infras-
tructure, localizing AI applications, supporting capacity-building efforts, and
implementing inclusive policies, AI can become a tool that empowers all com-
munities in the fight against climate change and environmental degradation.
17.10.2 Challenges
Despite these promising directions, significant obstacles must be addressed to
make AI truly sustainable. One of the most pressing challenges is the lack of
standardized measurement and reporting frameworks for evaluating AI’s envi-
ronmental footprint. Unlike traditional industries, where LCA methodologies
are well-established, AI systems require more comprehensive and adaptable
approaches that account for the full environmental impact of both hardware
(compute infrastructure) and software (model training and inference cycles).
While efforts such as MLCommons have begun integrating energy efÏciency
into benchmarking practices, a broader, globally recognized standard is neces-
sary to ensure consistency in reporting AI-related emissions.
Another critical challenge is optimizing AI infrastructure for longevity and
sustainability. AI accelerators and data center hardware must be designed with
maximized utilization, extended operational lifespans, and minimal environ-
mental impact in mind. Unlike conventional hardware refresh cycles, which
often prioritize performance gains over sustainability, future AI infrastructure
Chapter 17. Sustainable AI 969
17.11 Conclusion
The integration of AI into environmental sustainability presents both immense
opportunities and formidable challenges. As AI systems continue to scale in
complexity and influence, their environmental footprint must be addressed
through energy-efÏcient design, responsible infrastructure deployment, trans-
parent accountability measures, and policy-driven interventions. While AI
offers powerful capabilities for climate modeling, emissions reduction, resource
optimization, and biodiversity conservation, its reliance on compute-intensive
hardware, large-scale data processing, and energy-hungry model training ne-
cessitates a careful balance between progress and sustainability.
This chapter has explored the full lifecycle impact of AI systems, from their
carbon footprint and energy consumption to hardware manufacturing, e-waste
concerns, and the role of embedded AI in the growing “Internet of Trash.” We
have examined strategies for mitigating AI’s environmental impact, includ-
ing advances in green AI infrastructure, energy-aware model optimization,
and lifecycle-aware AI development. Additionally, we have highlighted the
importance of policy and regulatory frameworks in shaping a sustainable AI
ecosystem, emphasizing the need for measurement and reporting mandates,
incentive structures, and governance mechanisms that align AI innovation with
long-term environmental goals.
Public perception and engagement remain central to the discourse on AI and
sustainability. Transparent AI practices, explainable models, and ethical gover-
nance frameworks will be key to fostering trust and ensuring that AI solutions
are inclusive, equitable, and accountable. The responsible deployment of AI
in sustainability efforts must incorporate stakeholder input, interdisciplinary
collaboration, and a commitment to minimizing unintended consequences.
Looking ahead, the path toward sustainable AI requires continuous advance-
ments in hardware efÏciency, carbon-aware computing, renewable energy inte-
gration, and equitable access to AI resources. Overcoming challenges such as
data gaps, inconsistent environmental reporting, and planned obsolescence in
AI hardware will require collective efforts from AI researchers, environmental
scientists, policymakers, and industry leaders. By embedding sustainability at
the core of AI development, we can ensure that AI not only accelerates techno-
logical progress but also contributes meaningfully to a more sustainable and
resilient future.
AI has the potential to be a force for good in the fight against climate change
and resource depletion, but its long-term impact depends on the choices we
make today. Through innovation, regulation, and collective responsibility, AI
can evolve as a technology that enhances environmental sustainability rather
than exacerbating ecological strain. The decisions made by AI practitioners,
policymakers, and society at large will shape whether AI serves as a tool for
sustainable progress or an unchecked driver of environmental harm. The
imperative now is to act deliberately, designing AI systems that align with global
sustainability goals and contribute to a future where technological advancement
and ecological well-being coexist harmoniously.
Chapter 17. Sustainable AI 971
17.12 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 18
Robust AI
How do we develop fault-tolerant and resilient machine learning systems for real-world
deployment?
The integration of machine learning systems into real-world applications
demands fault-tolerant execution. However, these systems are inherently vul-
nerable to a spectrum of challenges that can degrade their capabilities. From
subtle hardware anomalies to sophisticated adversarial attacks and the un-
predictable nature of real-world data, the potential for failure is ever-present.
This reality underscores the need to fundamentally rethink how AI systems are
designed and deployed, placing robustness and trustworthiness at the forefront.
Building resilient machine learning systems is not merely a technical objective;
it is a foundational requirement for ensuring their safe and effective operation
in dynamic and uncertain environments.
973
18.1. Overview 974
L Learning Objectives
18.1 Overview
As ML systems become increasingly integrated into various domains, ranging
from cloud-based services to edge devices and embedded systems, the impact of
hardware and software faults on their performance and reliability grows more
pronounced. Looking ahead, as these systems become more complex and are
deployed in safety-critical applications, the need for robust and fault-tolerant
designs becomes paramount.
ML systems are expected to play critical roles in autonomous vehicles, smart
cities, healthcare, and industrial automation. In these domains, the conse-
quences of systemic failures, including hardware and software faults, and
malicious inputs such as adversarial attacks and data poisoning, and envi-
ronmental shifts, can be severe, potentially resulting in loss of life, economic
disruption, or environmental harm.
To address these risks, researchers and engineers must develop advanced
techniques for fault detection, isolation, and recovery, ensuring the reliable
operation of future ML systems.
Definition of Robust AI
We focus specifically on categories of faults and errors that can impact the
robustness of ML systems: errors arising from the underlying system, malicious
manipulation, and environmental changes.
Chapter 18. Robust AI 975
18.2.1 Cloud
In February 2017, Amazon Web Services (AWS) experienced a significant outage
due to human error during routine maintenance. An engineer inadvertently
entered an incorrect command, resulting in the shutdown of multiple servers.
This outage disrupted many AWS services, including Amazon’s AI-powered
assistant, Alexa. As a consequence, Alexa-enabled devices, including Amazon
Echo and third-party products that utilize Alexa Voice Service, were unrespon-
sive for several hours. This incident underscores the impact of human error on
cloud-based ML systems and the importance of robust maintenance protocols
and failsafe mechanisms.
In another case (Vangal et al. 2021), Facebook encountered a silent data cor-
ruption (SDC) issue in its distributed querying infrastructure, illustrated in
Figure 18.2. SDC refers to undetected errors during computation or data trans-
fer that propagate silently through system layers. Facebook’s system processed
SQL-like queries across datasets and supported a compression application de-
signed to reduce data storage footprints. Files were compressed when not in
use and decompressed upon read requests. A size check was performed before
decompression to ensure the file was valid. However, an unexpected fault
occasionally returned a file size of zero for valid files, leading to decompres-
sion failures and missing entries in the output database. The issue appeared
sporadically, with some computations returning correct file sizes, making it
particularly difÏcult to diagnose.
Spark pre-shuffle
Scale math.pow() Spark shuffle and
data store
merge database
(compressed)
5. Missing rows in DB
Defective
CPU
This case illustrates how silent data corruption can propagate across multiple
layers of the application stack, resulting in data loss and application failures in
large-scale distributed systems. Left unaddressed, such errors can degrade ML
system performance. For example, corrupted training data or inconsistencies
in data pipelines due to SDC may compromise model accuracy and reliability.
Similar challenges have been reported by other major companies. As shown in
Figure 18.3, Jeff Dean, Chief Scientist at Google DeepMind and Google Research,
highlighted these issues in AI hypercomputers during a keynote at MLSys 2024.
Chapter 18. Robust AI 977
18.2.2 Edge
In the edge computing domain, self-driving vehicles provide prominent exam-
ples of how faults can critically affect ML systems. These vehicles depend on
machine learning for perception, decision-making, and control, making them
particularly vulnerable to both hardware and software faults.
In May 2016, a fatal crash occurred when a Tesla Model S operating in Au-
topilot mode0 collided with a white semi-trailer truck. The system, relying on 0
Autopilot: Tesla’s driver as-
computer vision and ML algorithms, failed to distinguish the trailer against a sistance system that provides semi-
bright sky, leading to a high-speed impact. The driver, reportedly distracted at autonomous capabilities like steer-
the time, did not intervene, as shown in Figure 18.4. This incident raised serious ing, braking, and acceleration while
concerns about the reliability of AI-based perception systems and emphasized requiring active driver supervision.
the need for robust failsafe mechanisms in autonomous vehicles. A similar
case occurred in March 2018, when an Uber self-driving test vehicle struck and
killed a pedestrian in Tempe, Arizona. The accident was attributed to a flaw in
18.2. Real-World Applications 978
the vehicle’s object recognition software, which failed to classify the pedestrian
as an obstacle requiring avoidance.
18.2.3 Embedded
Embedded systems operate in resource-constrained and often safety-critical
environments. As AI capabilities are increasingly integrated into these systems,
the complexity and consequences of faults grow significantly.
One example comes from space exploration. In 1999, NASA’s Mars Polar
Lander mission experienced a catastrophic failure due to a software error in
its touchdown detection system (Figure 18.5). The lander’s software misinter-
preted the vibrations from the deployment of its landing legs as a successful
touchdown, prematurely shutting off its engines and causing a crash. This
incident underscores the importance of rigorous software validation and robust
system design, particularly for remote missions where recovery is impossi-
ble. As AI becomes more integral to space systems, ensuring robustness and
reliability will be essential to mission success.
“If the four main generator control units (associated with the engine-
mounted generators) were powered up at the same time, after 248 days of
continuous power, all four GCUs will go into failsafe mode at the same time,
resulting in a loss of all AC electrical power regardless of flight phase.” —
Federal Aviation Administration directive (2015)
Chapter 18. Robust AI 979
a specific value (e.g., always 0 or always 1), and device failures, such as
a malfunctioning processor or a damaged memory module. Permanent
faults can result in complete system failure or significant performance
degradation.
• Intermittent Faults: Intermittent faults are recurring faults that appear
and disappear intermittently. Unstable hardware conditions, such as loose
connections, aging components, or manufacturing defects, often cause
them. Intermittent faults can be challenging to diagnose and reproduce
because they may occur sporadically and under specific conditions. Exam-
ples include intermittent short circuits or contact resistance issues. These
faults can lead to unpredictable system behavior and sporadic errors.
Understanding this fault taxonomy and its relevance to both traditional com-
puting and ML systems provides a foundation for making informed decisions
when designing, implementing, and deploying fault-tolerant solutions. This
knowledge is crucial for improving the reliability and trustworthiness of com-
puting systems and ML applications.
18.3.1.1 Characteristics
All transient faults are characterized by their short duration and non-permanent
nature. They do not persist or leave any lasting impact on the hardware. How-
ever, they can still lead to incorrect computations, data corruption, or system
misbehavior if not properly handled. A classic example is shown in Figure 18.6,
where a single bit in memory unexpectedly changes state, potentially altering
critical data or computations.
Some of the common types of transient faults include Single Event Upsets
(SEUs) caused by ionizing radiation, voltage fluctuations (Reddi and Gupta
2013) due to power supply noise or electromagnetic interference, Electromag-
netic Interference (EMI) induced by external electromagnetic fields, Electrostatic
Discharge (ESD) resulting from sudden static electricity flow, crosstalk caused
by unintended signal coupling, ground bounce triggered by simultaneous
switching of multiple outputs, timing violations due to signal timing constraint
breaches, and soft errors in combinational logic affecting the output of logic
circuits (Mukherjee, Emer, and Reinhardt, n.d.). Understanding these different
types of transient faults is crucial for designing robust and resilient hardware
systems that can mitigate their impact and ensure reliable operation.
18.3.1.2 Causes
Transient faults can be attributed to various external factors. One common cause
is cosmic rays—high-energy particles originating from outer space. When
these particles strike sensitive areas of the hardware, such as memory cells
or transistors, they can induce charge disturbances that alter the stored or
Chapter 18. Robust AI 981
1 0
1 0
1 1 1 0
1 ••• Memory before
18.3.1.3 Mechanisms
Transient faults can manifest through different mechanisms depending on
the affected hardware component. In memory devices like DRAM or SRAM,
transient faults often lead to bit flips, where a single bit changes its value
from 0 to 1 or vice versa. This can corrupt the stored data or instructions. In
logic circuits, transient faults can cause glitches2 or voltage spikes propagating 2
Glitches: Momentary deviation
through the combinational logic3 , resulting in incorrect outputs or control
in voltage, current, or signal, often
signals. Transient faults can also affect communication channels, causing bit
causing incorrect operation.
errors or packet losses during data transmission.
3
Combinational logic: Digital
18.3.1.4 Impact on ML logic, wherein the output depends
A common example of a transient fault is a bit flip in the main memory. If an only on the current input states, not
important data structure or critical instruction is stored in the affected memory any past states.
location, it can lead to incorrect computations or program misbehavior. For
instance, a bit flip in the memory storing a loop counter can cause the loop to
18.3. Hardware Faults 982
During the inference phase, transient faults can impact the reliability and
trustworthiness of ML predictions. If a transient fault occurs in the memory
storing the trained model parameters or during the computation of inference
results, it can lead to incorrect or inconsistent predictions. For instance, a bit
flip in the activation values of a neural network can alter the final classification
or regression output (Mahmoud et al. 2020). In safety-critical applications,
such as autonomous vehicles or medical diagnosis, these faults can have severe
consequences, resulting in incorrect decisions or actions that may compromise
safety or lead to system failures (G. Li et al. 2017; S. Jha et al. 2019).
Chapter 18. Robust AI 983
18.3.2.1 Characteristics
Permanent faults cause persistent and irreversible malfunctions in hardware
components. The faulty component remains non-operational until it is repaired
or replaced. These faults are consistent and reproducible, meaning the faulty be-
havior is observed every time the affected component is used. They can impact
processors, memory modules, storage devices, or interconnects—potentially
leading to system crashes, data corruption, or complete system failure.
One notable example of a permanent fault is the Intel FDIV bug, discovered in
1994. This flaw affected the floating-point division (FDIV) units of certain Intel
Pentium processors, causing incorrect results for specific division operations
and leading to inaccurate calculations.
The FDIV bug occurred due to an error in the lookup table5 used by the divi-
5
sion unit. In rare cases, the processor would fetch an incorrect value, resulting Lookup Table: A data structure
in a slightly less precise result than expected. For instance, Figure 18.9 shows a used to replace a runtime computa-
fraction 4195835/3145727 plotted on a Pentium processor with the FDIV fault. tion with a simpler array indexing
The triangular regions highlight where erroneous calculations occurred. Ideally, operation.
all correct values would round to 1.3338, but the faulty results showed 1.3337,
indicating a mistake in the 5th digit.
Although the error was small, it could compound across many operations,
significantly affecting results in precision-critical applications such as scientific
simulations, financial calculations, and computer-aided design. The bug ulti-
mately led to incorrect outcomes in these domains and underscored the severe
consequences permanent faults can have.
The FDIV bug serves as a cautionary tale for ML systems. In such systems,
permanent faults in hardware components can result in incorrect computations,
impacting model accuracy and reliability. For example, if an ML system relies
on a processor with a faulty floating-point unit, similar to the FDIV bug, it
18.3. Hardware Faults 984
18.3.2.2 Causes
Permanent faults can arise from two primary sources: manufacturing defects
and wear-out mechanisms. Manufacturing defects are flaws introduced dur-
ing the fabrication process, including improper etching, incorrect doping, or
contamination. These defects may result in non-functional or partially func-
tional components. In contrast, wear-out mechanisms occur over time due to
prolonged use and operational stress. Phenomena like electromigration6 , ox-
6
The movement of metal atoms ide breakdown7 , and thermal stress8 degrade component integrity, eventually
in a conductor under the influence leading to permanent failure.
of an electric field.
7
18.3.2.3 Mechanisms
The failure of an oxide layer in
a transistor due to excessive electric Permanent faults manifest through several mechanisms, depending on their
field stress. nature and location. A common example is the stuck-at fault (Seong et al. 2010),
where a signal or memory cell becomes permanently fixed at either 0 or 1,
8
Degradation caused by re- regardless of the intended input, as shown in Figure 18.10. This type of fault
peated cycling through high and can occur in logic gates, memory cells, or interconnects and typically results in
low temperatures. incorrect computations or persistent data corruption.
Other mechanisms include device failures, in which hardware components
such as transistors or memory cells cease functioning entirely due to manufac-
turing defects or degradation over time. Bridging faults, which occur when
two or more signal lines are unintentionally connected, can introduce short
circuits or incorrect logic behaviors that are difÏcult to isolate.
Chapter 18. Robust AI 985
SAO SA1
SAO SA1 Figure 18.10: Stuck-at Fault Model
SAO SA1 in Digital Circuits. Source: Accendo
Reliability
SAO SA1
SAO SA1
SAO SA1
SAO SA1 SAO SA1
SAO SA1
SAO SA1
In more subtle cases, delay faults can arise when the propagation time of
a signal exceeds the allowed timing constraints. Although the logical values
may be correct, the violation of timing expectations can still result in erroneous
behavior. Similarly, interconnect faults, including open circuits caused by bro-
ken connections, high-resistance paths that impede current flow, and increased
capacitance that distorts signal transitions, can significantly degrade circuit
performance and reliability.
Memory subsystems are particularly vulnerable to permanent faults. Tran-
sition faults can prevent a memory cell from successfully changing its state,
while coupling faults result from unwanted interference between adjacent cells,
leading to unintentional state changes. Additionally, neighborhood pattern
sensitive faults occur when the state of a memory cell is incorrectly influenced
by the data stored in nearby cells, reflecting a more complex interaction between
circuit layout and logic behavior.
Finally, permanent faults can also occur in critical infrastructure components
such as the power supply network or clock distribution system. Failures in
these subsystems can affect circuit-wide functionality, introduce timing errors,
or cause widespread operational instability.
Taken together, these mechanisms illustrate the varied and often complex
ways in which permanent faults can undermine the behavior of computing
systems. For ML applications in particular, where correctness and consistency
are vital, understanding these fault modes is essential for developing resilient
hardware and software solutions.
18.3.2.4 Impact on ML
Permanent faults can severely disrupt the behavior and reliability of computing
systems. For example, a stuck-at fault in a processor’s arithmetic logic unit
(ALU) can produce persistent computational errors, leading to incorrect pro-
gram behavior or crashes. In memory modules, such faults may corrupt stored
data, while in storage devices, they can result in bad sectors or total data loss.
Interconnect faults may interfere with data transmission, leading to system
hangs or corruption.
For ML systems, these faults pose significant risks in both the training and
inference phases. During training, permanent faults in processors or memory
18.3. Hardware Faults 986
18.3.3.1 Characteristics
Intermittent faults are defined by their sporadic and non-deterministic behavior.
They occur irregularly and may manifest for short durations, disappearing
without a consistent pattern. Unlike permanent faults, they do not appear
every time the affected component is used, which makes them particularly
difÏcult to detect and reproduce. These faults can affect a variety of hardware
components, including processors, memory modules, storage devices, and
interconnects. As a result, they may lead to transient errors, unpredictable
system behavior, or data corruption.
Their impact on system reliability can be significant. For instance, an inter-
mittent fault in a processor’s control logic may disrupt the normal execution
path, causing irregular program flow or unexpected system hangs. In memory
modules, such faults can alter stored values inconsistently, leading to errors
that are difÏcult to trace. Storage devices affected by intermittent faults may
suffer from sporadic read/write errors or data loss, while intermittent faults
in communication channels can cause data corruption, packet loss, or unsta-
ble connectivity. Over time, these failures can accumulate, degrading system
performance and reliability (Rashid, Pattabiraman, and Gopalakrishnan 2015).
18.3.3.2 Causes
The causes of intermittent faults are diverse, ranging from physical degradation
to environmental influences. One common cause is the aging and wear-out of
electronic components. As hardware endures prolonged operation, thermal
cycling, and mechanical stress, it may develop cracks, fractures, or fatigue that
introduce intermittent faults. For instance, solder joints in ball grid arrays
(BGAs) or flip-chip packages can degrade over time, leading to intermittent
open circuits or short circuits.
Manufacturing defects and process variations can also introduce marginal
components that behave reliably under most circumstances but fail intermit-
tently under stress or extreme conditions. For example, Figure 18.12 shows
a residue-induced intermittent fault in a DRAM chip that leads to sporadic
failures.
18.3.3.3 Mechanisms
Intermittent faults can manifest through various physical and logical mecha-
nisms depending on their root causes. One such mechanism is the intermittent
open or short circuit, where physical discontinuities or partial connections
cause signal paths to behave unpredictably. These faults may momentarily
disrupt signal integrity, leading to glitches or unexpected logic transitions.
Another common mechanism is the intermittent delay fault (J. Zhang et
al. 2018), where signal propagation times fluctuate due to marginal timing
conditions, resulting in synchronization issues and incorrect computations. In
memory cells or registers, intermittent faults can appear as transient bit flips or
soft errors, corrupting data in ways that are difÏcult to detect or reproduce. Be-
cause these faults are often condition-dependent, they may only emerge under
specific thermal, voltage, or workload conditions, adding further complexity to
their diagnosis.
18.3.3.4 Impact on ML
Intermittent faults pose significant challenges for ML systems by undermining
computational consistency and model reliability. During the training phase,
such faults in processing units or memory can cause sporadic errors in the
computation of gradients, weight updates, or loss values. These errors may not
be persistent but can accumulate across iterations, degrading convergence and
leading to unstable or suboptimal models. Intermittent faults in storage may
corrupt input data or saved model checkpoints, further affecting the training
pipeline (Yi He et al. 2023).
In the inference phase, intermittent faults may result in inconsistent or er-
roneous predictions. Processing errors or memory corruption can distort ac-
tivations, outputs, or intermediate representations of the model, particularly
when faults affect model parameters or input data. Intermittent faults in data
pipelines, such as unreliable sensors or storage systems, can introduce subtle
input errors that degrade model robustness and output accuracy. In high-stakes
applications like autonomous driving or medical diagnosis, these inconsisten-
cies can result in dangerous decisions or failed operations.
Mitigating the effects of intermittent faults in ML systems requires a multi-
layered approach (Rashid, Pattabiraman, and Gopalakrishnan 2012). At the
hardware level, robust design practices, environmental controls, and the use of
higher-quality or more reliable components can reduce susceptibility to fault
conditions. Redundancy and error detection mechanisms can help identify and
recover from transient manifestations of intermittent faults.
At the software level, techniques such as runtime monitoring, anomaly de-
tection, and adaptive control strategies can provide resilience. Data validation
Chapter 18. Robust AI 989
are dedicated paths that allow access to internal registers and logic for testing paths incorporated within a proces-
During the BIST process, predefined test patterns are applied to the proces- isters and logic for testing.
sor’s internal circuitry, and the responses are compared against expected values. 12
R. W. Hamming’s seminal pa-
Any discrepancies indicate the presence of faults. Intel’s Xeon processors, for
per introduced error detection and
instance, include BIST mechanisms to test the CPU cores, cache memory, and
correction codes, significantly ad-
other critical components during system startup.
vancing digital communication reli-
Error Detection Codes. Error detection codes are widely used to detect data ability.
storage and transmission errors (Hamming 1950)12 . These codes add redundant
13
bits to the original data, allowing the detection of bit errors. Example: Parity In parity checks, an extra bit
checks are a simple form of error detection code shown in Figure 18.1313 . In a accounts for the total number of 1s
single-bit parity scheme, an extra bit is appended to each data word, making in a data word, enabling fundamen-
the number of 1s in the word even (even parity) or odd (odd parity). tal error detection.
18.3. Hardware Faults 990
Figure 18.13: Parity bit example. sequence of with eighth with eighth
Source: Computer Hope seven bits even parity bit odd parity bit
ComputerHope.com
When reading the data, the parity is checked, and if it doesn’t match the
expected value, an error is detected. More advanced error detection codes, such
as cyclic redundancy checks (CRC), calculate a checksum based on the data and
append it to the message. The checksum is recalculated at the receiving end
and compared with the transmitted checksum to detect errors. Error-correcting
code (ECC) memory modules, commonly used in servers and critical systems,
employ advanced error detection and correction codes to detect and correct
single-bit or multi-bit errors in memory.
system identifies a potential fault in one of the units and takes appropriate
action to ensure safe operation.
DMR in Tesla’s self-driving computer provides an extra safety and fault
tolerance layer. By having two independent units performing the same compu-
tations, the system can detect and mitigate faults that may occur in one of the
units. This redundancy helps prevent single points of failure and ensures that
critical functions remain operational despite hardware faults.
The system may employ additional mechanisms to determine which unit is
faulty in a mismatch. This can involve using diagnostic algorithms, comparing
the outputs with data from other sensors or subsystems, or analyzing the
consistency of the outputs over time. Once the faulty unit is identified, the
system can isolate it and continue operating using the output from the non-
faulty unit.
Tesla also incorporates redundancy mechanisms beyond DMR. For example,
they use redundant power supplies, steering and braking systems, and diverse
sensor suites (e.g., cameras, radar, and ultrasonic sensors) to provide multiple
layers of fault tolerance. These redundancies collectively contribute to the
overall safety and reliability of the self-driving system.
It’s important to note that while DMR provides fault detection and some level
of fault tolerance, TMR may provide a different level of fault masking. In DMR,
if both units experience simultaneous faults or the fault affects the comparison
mechanism, the system may be unable to identify the fault. Therefore, Tesla’s
SDCs rely on a combination of DMR and other redundancy mechanisms to
achieve a high level of fault tolerance.
The use of DMR in Tesla’s self-driving computer highlights the importance
of hardware redundancy in safety-critical applications. By employing redun-
dant computing units and comparing their outputs, the system can detect and
mitigate faults, enhancing the overall safety and reliability of the self-driving
functionality.
Another approach to hardware redundancy is the use of hot spares16 , as 16
Hot Spares: In a system redun-
employed by Google in its data centers to address SDC during ML training. Un- dancy design, these are the backup
like DMR and TMR, which rely on parallel processing and voting mechanisms components kept ready to instanta-
to detect and mask faults, hot spares provide fault tolerance by maintaining neously replace failing components
backup hardware units that can seamlessly take over computations when a fault without disrupting the operation.
is detected. As illustrated in Figure 18.15, during normal ML training, multiple
18.3. Hardware Faults 992
Watchdog timers. Watchdog timers are hardware components that monitor the
execution of critical tasks or processes (Pont and Ong 2002). They are commonly
used to detect and recover from software or hardware faults that cause a system
to become unresponsive or stuck in an infinite loop. In an embedded system,
a watchdog timer can be configured to monitor the execution of the main
control loop, as illustrated in Figure 18.16. The software periodically resets the
watchdog timer to indicate that it functions correctly. Suppose the software fails
to reset the timer within a specified time limit (timeout period). In that case, the
watchdog timer assumes that the system has encountered a fault and triggers
a predefined recovery action, such as resetting the system or switching to a
backup component. Watchdog timers are widely used in automotive electronics,
industrial control systems, and other safety-critical applications to ensure the
timely detection and recovery from faults.
Consistency checks and data validation. Consistency checks and data validation
techniques ensure data integrity and correctness at different processing stages
in an ML system (A. Lindholm et al. 2019). These checks help detect data
corruption, inconsistencies, or errors that may propagate and affect the sys-
tem’s behavior. Example: In a distributed ML system where multiple nodes
collaborate to train a model, consistency checks can be implemented to validate
the integrity of the shared model parameters. Each node can compute a check-
sum or hash of the model parameters before and after the training iteration, as
shown in Figure 18.17. Any inconsistencies or data corruption can be detected
by comparing the checksums across nodes. Additionally, range checks can be
applied to the input data and model outputs to ensure they fall within expected
bounds. For instance, if an autonomous vehicle’s perception system detects an
object with unrealistic dimensions or velocities, it can indicate a fault in the
sensor data or the perception algorithms (Wan et al. 2023).
Heartbeat and timeout mechanisms. Heartbeat mechanisms and timeouts are
commonly used to detect faults in distributed systems and ensure the liveness
and responsiveness of components (Kawazoe Aguilera, Chen, and Toueg 1997).
These are quite similar to the watchdog timers found in hardware. For example,
in a distributed ML system, where multiple nodes collaborate to perform tasks
such as data preprocessing, model training, or inference, heartbeat mechanisms
18.3. Hardware Faults 994
can be implemented to monitor the health and availability of each node. Each
node periodically sends a heartbeat message to a central coordinator or its
peer nodes, indicating its status and availability. Suppose a node fails to send
a heartbeat within a specified timeout period, as shown in Figure 18.18. In
that case, it is considered faulty, and appropriate actions can be taken, such as
redistributing the workload or initiating a failover mechanism. Timeouts can
also be used to detect and handle hanging or unresponsive components. For
example, if a data loading process exceeds a predefined timeout threshold, it
may indicate a fault in the data pipeline, and the system can take corrective
measures.
Heartbeat Heartbeat
Heartbeat
n-bits
Chapter 18. Robust AI 995
18.3.5 Summary
Table 18.1 provides a comparative analysis of transient, permanent, and in-
termittent faults. It outlines the primary characteristics or dimensions that
distinguish these fault types. Here, we summarize the relevant dimensions we
examined and explore the nuances that differentiate transient, permanent, and
intermittent faults in greater detail.
which can cause the model to misclassify it, as shown in Figure 18.20. In this
section, we will look at the different types of adversarial attacks and their impact
on machine learning models. Understanding these attacks highlights why it
is important to build models that are robust and able to handle these kinds of
challenges.
18.4.1.1 Mechanisms
Gradient-based Attacks. One prominent category of adversarial attacks is
gradient-based attacks. These attacks leverage the gradients of the ML model’s
loss function to craft adversarial examples. The Fast Gradient Sign Method
(FGSM) is a well-known technique in this category. FGSM perturbs the input
data by adding small noise in the direction of the gradient of the loss with
respect to the input. The goal is to maximize the model’s prediction error with
minimal distortion to the original input.
The adversarial example is generated using the following formula:
𝑥adv = 𝑥 + 𝜖 ⋅ sign(∇𝑥 𝐽 (𝜃, 𝑥, 𝑦))
Where:
• 𝑥 is the original input,
• 𝑦 is the true label,
• 𝜃 represents the model parameters,
• 𝐽 (𝜃, 𝑥, 𝑦) is the loss function,
• 𝜖 is a small scalar that controls the magnitude of the perturbation.
This method allows for fast and efÏcient generation of adversarial examples
by taking a single step in the direction that increases the loss most rapidly, as
shown in Figure 18.21.
Another variant, the Projected Gradient Descent (PGD) attack, extends FGSM
by iteratively applying the gradient update step, allowing for more refined and
powerful adversarial examples. PGD projects each perturbation step back into
a constrained norm ball around the original input, ensuring that the adversarial
example remains within a specified distortion limit. This makes PGD a stronger
white-box attack and a benchmark for evaluating model robustness.
Chapter 18. Robust AI 997
18.4.1.2 Impact on ML
Adversarial attacks on machine learning systems have emerged as a significant
concern in recent years, highlighting the potential vulnerabilities and risks
associated with the widespread adoption of ML technologies. These attacks
involve carefully crafted perturbations to input data that can deceive or mislead
ML models, leading to incorrect predictions or misclassifications, as shown in
Figure 18.22. The impact of adversarial attacks on ML systems is far-reaching
and can have serious consequences in various domains.
One striking example of the impact of adversarial attacks was demonstrated
by researchers in 2017. They experimented with small black and white stickers
on stop signs (Eykholt et al. 2017). To the human eye, these stickers did not
obscure the sign or prevent its interpretability. However, when images of the
18.4. Model Robustness 1000
sticker-modified stop signs were fed into standard trafÏc sign classification ML
models, a shocking result emerged. The models misclassified the stop signs as
speed limit signs over 85% of the time.
This demonstration shed light on the alarming potential of simple adversarial
stickers to trick ML systems into misreading critical road signs. The implications
of such attacks in the real world are significant, particularly in the context of
autonomous vehicles. If deployed on actual roads, these adversarial stickers
could cause self-driving cars to misinterpret stop signs as speed limits, leading
to dangerous situations, as shown in Figure 18.23. Researchers warned that
this could result in rolling stops or unintended acceleration into intersections,
endangering public safety.
The case study of the adversarial stickers on stop signs provides a concrete
illustration of how adversarial examples exploit how ML models recognize
patterns. By subtly manipulating the input data in ways that are invisible to
humans, attackers can induce incorrect predictions and create serious risks,
especially in safety-critical applications like autonomous vehicles. The attack’s
simplicity highlights the vulnerability of ML models to even minor changes in
the input, emphasizing the need for robust defenses against such threats.
The impact of adversarial attacks extends beyond the degradation of model
performance. These attacks raise significant security and safety concerns, partic-
ularly in domains where ML models are relied upon for critical decision-making.
In healthcare applications, adversarial attacks on medical imaging models could
Chapter 18. Robust AI 1001
18.4.2.1 Characteristics
Data poisoning is an attack in which the training data is deliberately manipu-
lated to compromise the performance or behavior of a machine learning model,
as described in (Biggio, Nelson, and Laskov 2012) and illustrated in Figure 18.24.
18.4. Model Robustness 1002
18.4.2.2 Mechanisms
Data poisoning can be implemented through a variety of mechanisms, depend-
ing on the attacker’s access to the system and understanding of the data pipeline.
These mechanisms reflect different strategies for how the training data can be
corrupted to achieve malicious outcomes.
One of the most direct approaches involves modifying the labels of training
data. In this method, an attacker selects a subset of training samples and alters
their labels—flipping 𝑦 = 1 to 𝑦 = 0, or reassigning categories in multi-class
settings. As shown in Figure 18.25, even small-scale label inconsistencies can
lead to significant distributional shifts and learning disruptions.
Another mechanism involves modifying the input features of training exam-
ples without changing the labels. This might include imperceptible pixel-level
changes in images, subtle perturbations in structured data, or embedding fixed
18.4. Model Robustness 1004
patterns that act as triggers for backdoor attacks. These alterations are often
designed using optimization techniques that maximize their influence on the
model while minimizing detectability.
More sophisticated attacks generate entirely new, malicious training exam-
ples. These synthetic samples may be created using adversarial methods, gen-
erative models, or even data synthesis tools. The aim is to carefully craft inputs
that will distort the decision boundary of the model when incorporated into the
training set. Such inputs may appear natural and legitimate but are engineered
to introduce vulnerabilities.
Other attackers focus on weaknesses in data collection and preprocessing. If
the training data is sourced from web scraping, social media, or untrusted user
submissions, poisoned samples can be introduced upstream. These samples
may pass through insufÏcient cleaning or validation checks, reaching the model
in a “trusted” form. This is particularly dangerous in automated pipelines
where human review is limited or absent.
In physically deployed systems, attackers may manipulate data at the source—
for example, altering the environment captured by a sensor. A self-driving
car might encounter poisoned data if visual markers on a road sign are sub-
tly altered, causing the model to misclassify it during training. This kind of
environmental poisoning blurs the line between adversarial attacks and data
poisoning, but the mechanism, which involves compromising the training data,
is the same.
Online learning systems represent another unique attack surface. These
systems continuously adapt to new data streams, making them particularly
susceptible to gradual poisoning. An attacker may introduce malicious samples
incrementally, causing slow but steady shifts in model behavior. This form of
attack is illustrated in Figure 18.26.
Insider collaboration adds a final layer of complexity. Malicious actors with
legitimate access to training data, including annotators, researchers, or data
vendors, can craft poisoning strategies that are more targeted and subtle than
external attacks. These insiders may have knowledge of the model architec-
Chapter 18. Robust AI 1005
Server
Poisoned Model Figure 18.26: Data Poisoning Attack.
Aggregation Owner on server
Local Data
18.4.2.3 Impact on ML
The effects of data poisoning extend far beyond simple accuracy degradation.
In the most general sense, a poisoned dataset leads to a corrupted model. But
the specific consequences depend on the attack vector and the adversary’s
objective.
One common outcome is the degradation of overall model performance.
When large portions of the training set are poisoned, often through label flipping
or the introduction of noisy features, the model struggles to identify valid
patterns, leading to lower accuracy, recall, or precision. In mission-critical
applications like medical diagnosis or fraud detection, even small performance
losses can result in significant real-world harm.
Targeted poisoning presents a different kind of danger. Rather than under-
mining the model’s general performance, these attacks cause specific misclassi-
fications. A malware detector, for instance, may be engineered to ignore one
particular signature, allowing a single attack to bypass security. Similarly, a
facial recognition model might be manipulated to misidentify a specific indi-
vidual, while functioning normally for others.
Some poisoning attacks introduce hidden vulnerabilities in the form of back-
doors or trojans. These poisoned models behave as expected during evaluation
but respond in a malicious way when presented with specific triggers. In
such cases, attackers can “activate” the exploit on demand, bypassing system
protections without triggering alerts.
Bias is another insidious impact of data poisoning. If an attacker poisons
samples tied to a specific demographic or feature group, they can skew the
18.4. Model Robustness 1006
p(z)
Domain 1
Domain 2 Figure 18.28: The curly brackets en-
close the distribution shift between
the environments. Here, z stands for
the spurious feature, and y stands
for label class.
z
p(y = 0 | z) p(y = 1 | z)
a) Diversity Shift b) Correlation Shift
Another major source is temporal drift, where the input distribution evolves
gradually or suddenly over time. In production settings, data changes due to
new trends, seasonal effects, or shifts in user behavior. For instance, in a fraud
detection system, fraud patterns may evolve as adversaries adapt. Without
ongoing monitoring or retraining, models become stale and ineffective. This
form of shift is visualized in Figure 18.29.
Contextual changes arise when deployment environments differ from train-
ing conditions due to external factors such as lighting, sensor variation, or user
behavior. For example, a vision model trained in a lab under controlled lighting
may underperform when deployed in outdoor or dynamic environments.
Another subtle but critical factor is unrepresentative training data. If the
training dataset fails to capture the full variability of the production environ-
ment, the model may generalize poorly. For example, a facial recognition model
trained predominantly on one demographic group may produce biased or inac-
curate predictions when deployed more broadly. In this case, the shift reflects
missing diversity or structure in the training data.
Feature A Feature A
Feature B Feature B
T=0 T=1
Distribution shifts like these can dramatically reduce the performance and
reliability of ML models in production. Building robust systems requires not
only understanding these shifts, but actively detecting and responding to them
as they emerge.
18.4.3.2 Mechanisms
Distribution shifts arise from a variety of underlying mechanisms—both natural
and system-driven. Understanding these mechanisms helps practitioners detect,
diagnose, and design mitigation strategies.
One common mechanism is a change in data sources. When data collected at
inference time comes from different sensors, APIs, platforms, or hardware than
the training data, even subtle differences in resolution, formatting, or noise can
introduce significant shifts. For example, a speech recognition model trained
on audio from one microphone type may struggle with data from a different
device.
Temporal evolution refers to changes in the underlying data over time. In
recommendation systems, user preferences shift. In finance, market conditions
change. These shifts may be slow and continuous or abrupt and disruptive.
Without temporal awareness or continuous evaluation, models can become
Chapter 18. Robust AI 1009
18.4.3.3 Impact on ML
Distribution shift can affect nearly every dimension of ML system performance,
from prediction accuracy and latency to user trust and system maintainability.
A common and immediate consequence is degraded predictive performance.
When the data at inference time differs from training data, the model may
produce systematically inaccurate or inconsistent predictions. This erosion
of accuracy is particularly dangerous in high-stakes applications like fraud
detection, autonomous vehicles, or clinical decision support.
Another serious effect is loss of reliability and trustworthiness. As distribu-
tion shifts, users may notice inconsistent or erratic behavior. For example, a
recommendation system might begin suggesting irrelevant or offensive con-
tent. Even if overall accuracy metrics remain acceptable, loss of user trust can
undermine the system’s value.
Distribution shift also amplifies model bias. If certain groups or data seg-
ments are underrepresented in the training data, the model may fail more fre-
quently on those groups. Under shifting conditions, these failures can become
more pronounced, resulting in discriminatory outcomes or fairness violations.
There is also a rise in uncertainty and operational risk. In many production
settings, model decisions feed directly into business operations or automated
actions. Under shift, these decisions become less predictable and harder to
validate, increasing the risk of cascading failures or poor decisions downstream.
From a system maintenance perspective, distribution shifts complicate re-
training and deployment workflows. Without robust mechanisms for drift de-
tection and performance monitoring, shifts may go unnoticed until performance
degrades significantly. Once detected, retraining may be required—raising chal-
lenges related to data collection, labeling, model rollback, and validation. This
creates friction in continuous integration and deployment (CI/CD) workflows
and can significantly slow down iteration cycles.
Moreover, distribution shift increases vulnerability to adversarial attacks. At-
tackers can exploit the model’s poor calibration on unfamiliar data, using slight
perturbations to push inputs outside the training distribution and cause fail-
ures. This is especially concerning when system feedback loops or automated
decisioning pipelines are in place.
From a systems perspective, distribution shift is not just a modeling concern—
it is a core operational challenge. It requires end-to-end system support: mech-
anisms for data logging, drift detection, automated alerts, model versioning,
and scheduled retraining. ML systems must be designed to detect when per-
formance degrades in production, diagnose whether a distribution shift is the
cause, and trigger appropriate mitigation actions. This might include human-
in-the-loop review, fallback strategies, model retraining pipelines, or staged
deployment rollouts.
In mature ML systems, handling distribution shift becomes a matter of infras-
tructure, observability, and automation, not just modeling technique. Failing to
account for it risks silent model failure in dynamic, real-world environments—
precisely where ML systems are expected to deliver the most value.
A summary of common types of distribution shifts, their effects on model
performance, and potential system-level responses is shown in Table 18.3.
Chapter 18. Robust AI 1011
Table 18.3: Common types of distribution shift, their effects, and system-level
mitigations.
Type of Shift Cause or Example Consequence for Model System-Level Response
Covariate Change in input features (e.g., Model misclassifies new Monitor input distributions;
Shift sensor calibration drift) inputs despite consistent retrain with updated
labels features
Label Shift Change in label distribution (e.g., Prediction probabilities Track label priors; reweight
new class frequencies in usage) become skewed or adapt output calibration
Concept Evolving relationship between Model performance Retrain frequently; use
Drift inputs and outputs (e.g. fraud degrades over time continual or online learning
tactics)
Domain Train on reviews, deploy on Poor generalization due to Use domain adaptation or
Mismatch tweets different vocabularies or fine-tuning
styles
Contextual New deployment environment Performance varies by Collect contextual data;
Change (e.g., lighting, user behavior) context monitor conditional
accuracy
Selection Underrepresentation during Biased predictions for Validate dataset balance;
Bias training unseen groups augment training data
Feedback Model outputs affect future Reinforced drift, Monitor feedback effects;
Loops inputs (e.g., recommender unpredictable patterns consider counterfactual
systems) logging
Adversarial Attackers introduce OOD inputs Model becomes vulnerable Use robust training; detect
Shift or perturbations to targeted failures out-of-distribution inputs
in the ensemble, leading to more reliable and robust predictions. Model di-
versification techniques, such as using different preprocessing techniques or
feature representations for each model in the ensemble, can further enhance
the robustness.
Evaluation and Testing. Conduct thorough evaluation and testing to assess
the effectiveness of adversarial defense techniques and measure the robustness
of ML models.
Adversarial robustness metrics quantify the model’s resilience to adversar-
ial attacks. These metrics can include the model’s accuracy on adversarial
examples, the average distortion required to fool the model, or the model’s per-
formance under different attack strengths. By comparing these metrics across
different models or defense techniques, practitioners can assess and compare
their robustness levels.
Standardized adversarial attack benchmarks and datasets provide a common
ground for evaluating and comparing the robustness of ML models. These
benchmarks include datasets with pre-generated adversarial examples and
tools and frameworks for generating adversarial attacks. Examples of pop-
ular adversarial attack benchmarks include the MNIST-C, CIFAR-10-C, and
ImageNet-C (Hendrycks and Dietterich 2019) datasets, which contain corrupted
or perturbed versions of the original datasets.
Practitioners can develop more robust and resilient ML systems by lever-
aging these adversarial example detection techniques, defense strategies, and
robustness evaluation methods. However, it is important to note that adversar-
ial robustness is an ongoing research area, and no single technique provides
complete protection against all types of adversarial attacks. A comprehensive
approach that combines multiple defense mechanisms and regular testing is
essential to maintain the security and reliability of ML systems in the face of
evolving adversarial threats.
Item
Defender
distinct clusters or lie far away from the normal data clusters. By applying clus-
tering algorithms like K-means, DBSCAN, or hierarchical clustering, anomalous
clusters or data points that do not belong to any cluster can be identified. These
anomalous instances are then treated as potentially poisoned data.
Autoencoders are neural networks trained to reconstruct the input data from
a compressed representation, as shown in Figure 18.32. They can be used for
anomaly detection by learning the normal patterns in the data and identifying
instances that deviate from them. During training, the autoencoder is trained
on clean, unpoisoned data. At inference time, the reconstruction error for
each data point is computed. Data points with high reconstruction errors are
considered abnormal and potentially poisoned, as they do not conform to the
learned normal patterns.
Input Output
Encoder Decoder
Data validation involves verifying the integrity and consistency of the training
data. This can include checking for data type consistency, range validation,
and cross-field dependencies. By defining and enforcing data validation rules,
anomalous or inconsistent data points indicative of data poisoning can be
identified and flagged for further investigation.
Data provenance and lineage tracking involve maintaining a record of data’s
origin, transformations, and movements throughout the ML pipeline. By docu-
menting the data sources, preprocessing steps, and any modifications made to
the data, practitioners can trace anomalies or suspicious patterns back to their
origin. This helps identify potential points of data poisoning and facilitates the
investigation and mitigation process.
Robust Training. Robust optimization techniques can be used to modify the
training objective to minimize the impact of outliers or poisoned instances.
This can be achieved by using robust loss functions less sensitive to extreme
values, such as the Huber loss or the modified Huber loss18 . Regularization
18
techniques19 , such as L1 or L2 regularization, can also help in reducing the Huber Loss: A loss func-
model’s sensitivity to poisoned data by constraining the model’s complexity tion used in robust regression that is
and preventing overfitting. less sensitive to outliers in data than
Robust loss functions are designed to be less sensitive to outliers or noisy data squared error loss.
points. Examples include the modified Huber loss, the Tukey loss (Beaton and
19
Tukey 1974), and the trimmed mean loss. These loss functions down-weight or Regularization: A method
ignore the contribution of abnormal instances during training, reducing their used in neural networks to prevent
impact on the model’s learning process. Robust objective functions, such as overfitting in models by adding a
the minimax20 or distributionally robust objective, aim to optimize the model’s cost term to the loss function.
performance under worst-case scenarios or in the presence of adversarial per- 20
Minimax: A decision-making
turbations.
strategy, used in game theory and
Data augmentation techniques involve generating additional training ex-
decision theory, which tries to mini-
amples by applying random transformations or perturbations to the existing
mize the maximum possible loss.
data Figure 18.33. This helps in increasing the diversity and robustness of the
training dataset. By introducing controlled variations in the data, the model
becomes less sensitive to specific patterns or artifacts that may be present in
poisoned instances. Randomization techniques, such as random subsampling
or bootstrap aggregating, can also help reduce the impact of poisoned data by
training multiple models on different subsets of the data and combining their
predictions.
Secure Data Sourcing. Implementing the best data collection and curation
practices can help mitigate the risk of data poisoning. This includes establish-
ing clear data collection protocols, verifying the authenticity and reliability of
18.4. Model Robustness 1016
data sources, and conducting regular data quality assessments. Sourcing data
from trusted and reputable providers and following secure data handling prac-
tices can reduce the likelihood of introducing poisoned data into the training
pipeline.
Strong data governance and access control mechanisms are essential to pre-
vent unauthorized modifications or tampering with the training data. This
involves defining clear roles and responsibilities for data access, implementing
access control policies based on the principle of least privilege,21 and moni-
21
Principle of Least Privilege: A toring and logging data access activities. By restricting access to the training
security concept in which a user is data and maintaining an audit trail, potential data poisoning attempts can be
given the minimum levels of access detected and investigated.
necessary to complete his/her job Detecting and mitigating data poisoning attacks requires a multifaceted ap-
functions. proach that combines anomaly detection, data sanitization,22 robust training
22 techniques, and secure data sourcing practices. By implementing these mea-
Data Sanitization: The process
sures, ML practitioners can improve the resilience of their models against data
of deliberately, permanently, and ir-
poisoning and ensure the integrity and trustworthiness of the training data.
reversibly removing or destroying
However, it is important to note that data poisoning is an active area of research,
the data stored on a memory device
and new attack vectors and defense mechanisms continue to emerge. Staying
to make it unrecoverable.
informed about the latest developments and adopting a proactive and adap-
tive approach to data security is crucial for maintaining the robustness of ML
systems.
Task 1
Knowledge transfer
Task 2
18.5.1 Characteristics
Software faults in ML frameworks originate from various sources, including
programming errors, architectural misalignments, and version incompatibili-
Chapter 18. Robust AI 1019
ties. These faults exhibit several important characteristics that influence how
they arise and propagate in practice.
One defining feature of software faults is their diversity. Faults can range from
syntactic and logical errors to more complex manifestations such as memory
leaks, concurrency bugs, or failures in integration logic. The broad variety of
potential fault types complicates both their identification and resolution, as
they often surface in non-obvious ways.
A second key characteristic is their tendency to propagate across system
boundaries. An error introduced in a low-level module, such as a tensor al-
location routine or a preprocessing function, can produce cascading effects
that disrupt model training, inference, or evaluation. Because ML frameworks
are often composed of interconnected components, a fault in one part of the
pipeline can introduce failures in seemingly unrelated modules.
Some faults are intermittent, manifesting only under specific conditions such
as high system load, particular hardware configurations, or rare data inputs.
These transient faults are notoriously difÏcult to reproduce and diagnose, as
they may not consistently appear during standard testing procedures.
Furthermore, software faults may subtly interact with ML models themselves.
For example, a bug in a data transformation script might introduce systematic
noise or shift the distribution of inputs, leading to biased or inaccurate predic-
tions. Similarly, faults in the serving infrastructure may result in discrepancies
between training-time and inference-time behaviors, undermining deployment
consistency.
The consequences of software faults extend to a range of system properties.
Faults may impair performance by introducing latency or inefÏcient memory
usage; they may reduce scalability by limiting parallelism; or they may compro-
mise reliability and security by exposing the system to unexpected behaviors
or malicious exploitation.
Finally, the manifestation of software faults is often shaped by external depen-
dencies, such as hardware platforms, operating systems, or third-party libraries.
Incompatibilities arising from version mismatches or hardware-specific behav-
ior may result in subtle, hard-to-trace bugs that only appear under certain
runtime conditions.
A thorough understanding of these characteristics is essential for developing
robust software engineering practices in ML. It also provides the foundation
for the detection and mitigation strategies described later in this section.
18.5.2 Mechanisms
Software faults in ML frameworks arise through a variety of mechanisms,
reflecting the complexity of modern ML pipelines and the layered architecture of
supporting tools. These mechanisms correspond to specific classes of software
failures that commonly occur in practice.
One prominent class involves resource mismanagement, particularly with re-
spect to memory. Improper memory allocation, including the failure to release
buffers or file handles, can lead to memory leaks and, eventually, to resource
exhaustion. This is especially detrimental in deep learning applications, where
large tensors and GPU memory allocations are common. As shown in Fig-
18.5. Software Faults 1020
ure 18.35, inefÏcient memory usage or the failure to release GPU resources can
cause training procedures to halt or significantly degrade runtime performance.
18.5.3 Impact on ML
The consequences of software faults can be profound, affecting not only the
correctness of model outputs but also the broader usability and reliability of an
ML system in production.
Performance degradation is a common symptom, often resulting from mem-
ory leaks, inefÏcient resource scheduling, or contention between concurrent
threads. These issues tend to accumulate over time, leading to increased latency,
reduced throughput, or even system crashes. As noted by (Maas et al. 2024),
Chapter 18. Robust AI 1021
Table 18.4: Summary of detection and mitigation techniques for software faults.
Category Technique Purpose When to Apply
Testing and Unit testing, integration testing, Verify correctness and identify During development
Validation regression testing regressions
18.5. Software Faults 1022
The first line of defense involves systematic testing. Unit testing verifies that
individual components behave as expected under normal and edge-case condi-
tions. Integration testing ensures that modules interact correctly across bound-
aries, while regression testing detects errors introduced by code changes. Con-
tinuous testing is essential in fast-moving ML environments, where pipelines
evolve rapidly and small modifications may have system-wide consequences.
As shown in Figure 18.36, automated regression tests help preserve functional
correctness over time.
despite hardware faults. This section provides an overview of widely used fault
models in the literature and the tools and frameworks developed to evaluate
the impact of such faults on ML systems.
Soft error
Yes
Software-level analysis
Incorrect output No Masked • Dynamically dead instructions
or system crash? (software) • Logical, compare instructions
• Uninfluential branch instructions
Yes
Failure
To address these discrepancies, tools like Fidelity (Yi He, Balaprakash, and
Li 2020) have been developed to align fault models across abstraction layers.
By mapping software-observed fault behaviors to corresponding hardware-
level patterns (E. Cheng et al. 2016), Fidelity offers a more accurate means of
simulating hardware faults at the software level. While lower-level tools capture
the true propagation of errors through a hardware system, they are generally
slower and more complex. Software-level tools, such as those implemented in
PyTorch or TensorFlow, are faster and easier to use for large-scale robustness
testing, albeit with less precision.
18.6.2.1 Methods
Two of the most common hardware-based fault injection methods are FPGA-
based fault injection and radiation or beam testing.
FPGA-based Fault Injection. Field-Programmable Gate Arrays (FPGAs)
are reconfigurable integrated circuits that can be programmed to implement
various hardware designs. In the context of fault injection, FPGAs offer high
precision and accuracy, as researchers can target specific bits or sets of bits
within the hardware. By modifying the FPGA configuration, faults can be
introduced at specific locations and times during the execution of an ML model.
FPGA-based fault injection allows for fine-grained control over the fault model,
enabling researchers to study the impact of different types of faults, such as
single-bit flips or multi-bit errors. This level of control makes FPGA-based
fault injection a valuable tool for understanding the resilience of ML systems to
hardware faults.
While FPGA-based methods allow precise, controlled fault injection, other
approaches aim to replicate fault conditions found in natural environments.
Radiation or Beam Testing. Radiation or beam testing (Velazco, Foucard, and
Peronnard 2010) exposes hardware running ML models to high-energy particles
Chapter 18. Robust AI 1027
18.6.2.2 Limitations
Despite their high accuracy, hardware-based fault injection methods have sev-
eral limitations that can hinder their widespread adoption.
26
First, cost is a major barrier. Both FPGA-based and beam testing26 approaches Beam Testing: A testing
require specialized hardware and facilities, which can be expensive to set up method that exposes hardware to
and maintain. This makes them less accessible to research groups with limited controlled particle radiation to eval-
funding or infrastructure. uate its resilience to soft errors.
Second, these methods face challenges in scalability. Injecting faults and col- Common in aerospace, medical de-
lecting data directly on hardware is time-consuming, which limits the number vices, and high-reliability comput-
of experiments that can be run in a reasonable timeframe. This is especially re- ing.
strictive when analyzing large ML systems or performing statistical evaluations
across many fault scenarios.
18.6. Tools and Frameworks 1028
18.6.3.2 Limitations
While software-based fault injection tools offer significant advantages in terms
of speed, flexibility, and accessibility, they are not without limitations. These
constraints can impact the accuracy and realism of fault injection experiments,
particularly when assessing the robustness of ML systems to real-world hard-
ware faults.
One major concern is accuracy. Because software-based tools operate at
higher levels of abstraction, they may not always capture the full spectrum
of effects that hardware faults can produce. Low-level hardware interactions,
including subtle timing errors, voltage fluctuations, and architectural side
effects, can be missed entirely in high-level simulations. As a result, fault
injection studies that rely solely on software models may under- or overestimate
a system’s true vulnerability to certain classes of faults.
Closely related is the issue of fidelity. While software-based methods are
often designed to emulate specific fault behaviors, the extent to which they
reflect real-world hardware conditions can vary. For example, simulating a
single-bit flip in the value of a neural network weight may not fully replicate
how that same bit error would propagate through memory hierarchies or affect
computation units on an actual chip. The more abstract the tool, the greater
18.6. Tools and Frameworks 1030
the risk that the simulated behavior will diverge from physical behavior under
fault conditions.
Moreover, because software-based tools are easier to modify, there is a risk
of unintentionally deviating from realistic fault assumptions. This can occur if
the chosen fault model is overly simplified or not grounded in empirical data
from actual hardware behavior. As discussed later in the section on bridging
the hardware-software gap, tools like Fidelity (Yi He, Balaprakash, and Li
2020) attempt to address these concerns by aligning software-level models with
known hardware-level fault characteristics.
Despite these limitations, software-based fault injection remains a critical part
of the ML robustness research toolkit. When used appropriately, particularly
when used in conjunction with hardware-based validation, these tools provide
a scalable and efÏcient way to explore large design spaces, identify vulnerable
components, and develop mitigation strategies. As fault modeling techniques
continue to evolve, the integration of hardware-aware insights into software-
based tools will be key to improving their realism and impact.
As discussed in the work by (Bolchini et al. 2023), hardware faults can exhibit
complex spatial distribution patterns that are difÏcult to replicate using purely
software-based fault models. They identify four characteristic fault propagation
patterns: single point, where the fault corrupts a single value in a feature map;
same row, where a partial or entire row in a feature map is corrupted; bullet
wake, where the same location across multiple feature maps is affected; and
shatter glass, a more complex combination of both same row and bullet wake
behaviors. These diverse patterns, visualized in Figure 18.42, highlight the
limits of simplistic injection strategies and emphasize the need for hardware-
aware modeling when evaluating ML system robustness.
To address this abstraction gap, researchers have developed tools that explic-
itly aim to map low-level hardware error behavior to software-visible effects.
One such tool is Fidelity, which bridges this gap by studying how hardware-
level faults propagate and become observable at higher software layers. The
next section discusses Fidelity in more detail.
18.6.4.1 Fidelity
Fidelity (Yi He, Balaprakash, and Li 2020) is a tool designed to model hardware
faults more accurately within software-based fault injection experiments. Its
core goal is to bridge the gap between low-level hardware fault behavior and
the higher-level effects observed in machine learning systems by simulating
how faults propagate through the compute stack.
The central insight behind Fidelity is that not all faults need to be modeled
individually at the hardware level to yield meaningful results. Instead, Fidelity
focuses on how faults manifest at the software-visible state and identifies equiv-
alence relationships that allow representative modeling of entire fault classes.
To accomplish this, it relies on several key principles:
First, fault propagation is studied to understand how a fault originating in
hardware can move through various layers, including architectural registers,
memory hierarchies, and numerical operations, eventually altering values in
18.6. Tools and Frameworks 1034
Tools like Fidelity are central to this effort. By establishing mappings between
low-level hardware behavior and higher-level software effects, Fidelity and
similar tools empower researchers to conduct fault injection experiments that
are not only faster and more scalable, but also grounded in real-world system
behavior.
As ML systems continue to increase in scale and are deployed in increasingly
safety-critical environments, this kind of hardware-aware modeling will become
even more important. Ongoing research in this space aims to further refine the
translation between hardware and software fault models and to develop tools
that offer both efÏciency and realism in evaluating ML system resilience. These
advances will provide the community with more powerful, reliable methods
for understanding and defending against the effects of hardware faults.
18.7 Conclusion
The pursuit of robust AI is a multifaceted endeavor that is critical for the reliable
deployment of machine learning systems in real-world environments. As ML
move from controlled research settings to practical applications, robustness
becomes not just a desirable feature but a foundational requirement. Deploying
AI in practice means engaging directly with the challenges that can compromise
system performance, safety, and reliability.
We examined the broad spectrum of issues that threaten AI robustness,
beginning with hardware-level faults. Transient faults may introduce temporary
computational errors, while permanent faults, including the well-known Intel
FDIV bug, can lead to persistent inaccuracies that affect system behavior over
time.
Beyond hardware, machine learning models themselves are susceptible to a
variety of threats. Adversarial examples, such as the misclassification of mod-
ified stop signs, reveal how subtle input manipulations can cause erroneous
outputs. Likewise, data poisoning techniques, exemplified by the Nightshade
project, illustrate how malicious training data can degrade model performance
or implant hidden backdoors, posing serious security risks in practical deploy-
ments.
The chapter also addressed the impact of distribution shifts, which often
result from temporal evolution or domain mismatches between training and
deployment environments. Such shifts challenge a model’s ability to gener-
alize and perform reliably under changing conditions. Compounding these
issues are faults in the software infrastructure, including frameworks, libraries,
and runtime components, which can propagate unpredictably and undermine
system integrity.
To navigate these risks, the use of robust tools and evaluation frameworks
is essential. Tools such as PyTorchFI and Fidelity enable researchers and prac-
titioners to simulate fault scenarios, assess vulnerabilities, and systematically
improve system resilience. These resources are critical for translating theoretical
robustness principles into operational safeguards.
Ultimately, building robust AI requires a comprehensive and proactive ap-
proach. Fault tolerance, security mechanisms, and continuous monitoring
18.8. Resources 1036
18.8 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 19
AI for Good
Purpose
How can we harness machine learning systems to address critical societal challenges,
and what principles guide the development of solutions that create lasting positive
impact?
The application of AI systems to societal challenges represents the culmi-
nation of technical capability and social responsibility. Impact-driven devel-
opment reveals essential patterns for translating technological potential into
meaningful change, highlighting critical relationships between system design
and societal outcomes. The implementation of solutions for social good show-
cases pathways for addressing complex challenges while maintaining technical
rigor and operational effectiveness. Understanding these impact dynamics
provides insights into creating transformative systems, establishing principles
for designing AI solutions that advance human welfare, and promote positive
societal transformation.
1037
19.1. Overview 1038
L Learning Objectives
19.1 Overview
Previous chapters examined the fundamental components of machine learning
systems - from neural architectures and training methodologies to accelera-
tion techniques and deployment strategies. These chapters established how
to build, optimize, and operate ML systems at scale. The examples and tech-
niques focused primarily on scenarios where computational resources, reliable
infrastructure, and technical expertise were readily available.
Machine learning systems, however, extend beyond commercial and indus-
trial applications. While recommendation engines, computer vision systems,
and natural language processors drive business value, ML systems also hold
immense potential for addressing pressing societal challenges. This potential
0
remains largely unrealized due to the distinct challenges of deploying ML
Resource-constrained environ- systems in resource-constrained environments0 .
ments: Areas with limited comput-
Engineering ML systems for social impact differs fundamentally from com-
ing capabilities, connectivity, and
mercial deployments. These systems must operate in environments with limited
support infrastructure.
computing resources, intermittent connectivity, and minimal technical support
infrastructure. Such constraints reshape every aspect of ML system design—
from model architecture and training approaches to deployment patterns and
maintenance strategies. Success requires rethinking traditional ML system
design patterns to create solutions that are robust, maintainable, and effective
despite these limitations.
Building ML systems for AI for social good is an engineering challenge.
This chapter highlights some AI applications for social good and examines
the unique requirements, constraints, and opportunities in engineering ML
systems for social impact. We analyze how core ML system components adapt
to resource-constrained environments, explore architectural patterns that en-
able robust deployment across the computing spectrum, and study real-world
implementations in healthcare, agriculture, education, and environmental mon-
itoring. Through these examples and the discussions involved, we develop
frameworks for designing ML systems that deliver sustainable social impact.
19.3.1 Agriculture
Watch on YouTube
Plant Village Nuru çĖ Important 11: Plant Village Nuru
TV Watch on YouTube
In Sub-Saharan Africa, cassava farmers have long battled diseases that devas-
Scan with your phone tate crops and livelihoods. Now, with the help of mobile ML-powered smart-
to watch the video
phone apps, as shown in Figure 19.2, they can snap a photo of a leaf and receive
instant feedback on potential diseases. This early detection system has reduced
cassava losses from 40% to just 5%, offering hope to farmers in disconnected
regions where access to agricultural advisors is limited (Ramcharan et al. 2017).
Across Southeast Asia, rice farmers are confronting increasingly unpre-
dictable weather patterns. In Indonesia, Tiny ML sensors are transforming their
ability to adapt by monitoring microclimates across paddies. These low-power
devices process data locally to optimize water usage, enabling precision irriga-
tion even in areas with minimal infrastructure (Tirtalistyani, Murtiningrum,
and Kanwar 2022).
Chapter 19. AI for Good 1041
19.3.2 Healthcare
For millions in underserved communities, access to healthcare often means
long waits and travel to distant clinics. Tiny ML is changing that by enabling
diagnostics to occur at the patient’s side. For example, a low-cost wearable
developed by Respira x Colabs uses embedded machine learning to analyze
cough patterns and detect pneumonia. Designed for remote areas, the device
operates independently of internet connectivity and is powered by a simple
microcontroller, making life-saving diagnostics accessible to those who need it
most.
Tiny ML’s potential extends to tackling global health issues like vector-borne
diseases1 that are spread by mosquitoes. Researchers have developed low-cost 1
Vector-borne diseases: Illnesses
devices that use machine learning to identify mosquito species by their wingbeat
caused by pathogens transmitted
frequencies (Altayeb, Zennaro, and Rovai 2022). This technology enables real-
by vectors like mosquitoes, ticks, or
time monitoring of malaria-carrying mosquitoes. It offers a scalable solution
fleas.
for malaria control in high-risk regions.
In parallel, Cloud ML is advancing healthcare research and diagnostics on a
broader scale. Platforms like Google Genomics analyze vast datasets to identify
disease markers, accelerating breakthroughs in personalized medicine. These
examples show how AI technologies, ranging from the portability of Tiny
19.3. Key AI Applications 1042
TV Watch on YouTube
TV Watch on YouTube
TV Watch on YouTube
data on the collar itself, these devices minimize power consumption and re-
duce the need for frequent battery changes (Verma 2022). Meanwhile, Tiny ML
systems are enabling anti-poaching efforts by detecting threats like gunshots
or human activity and relaying alerts to rangers in real time (Bamoumen et al.
2022).
At a global scale, Cloud ML is being used to monitor illegal fishing activities.
Platforms like Global Fishing Watch analyze satellite data to detect anomalies,
helping governments enforce regulations and protect marine ecosystems. These
examples highlight how AI technologies are enabling real-time monitoring and
decision-making, advancing conservation efforts in profound ways.
Table 19.1: Comparison of resource constraints and challenges across rural de-
ployments, urban deployments, and scaling in machine learning
systems for social impact contexts.
Aspect Rural Deployment Urban Deployment Scaling Challenges
Computational Microcontroller (ESP32: Server-grade systems Aggressive model
Resources 240 MHz, 520 KB RAM) (100-200 W, 32-64 GB quantization (e.g., 50 MB to
RAM) 500 KB)
Power Solar and battery systems Stable grid power Optimized power usage (for
Infrastructure (10-20 W, 2000-3000 mAh deployment devices)
battery)
Network LoRa, NB-IoT (0.3-50 kbps, High-bandwidth options Protocol adjustments (LoRa,
Bandwidth 60-250 kbps) NB-IoT, Sigfox: 100-600 bps)
Data Availability Sparse, heterogeneous Large volumes of Specialized pipelines (For
data sources (500 KB/day standardized data privacy-sensitive data)
from rural clinics) (Gigabytes from urban
hospitals)
Model Footprint Highly quantized models Cloud/edge systems Model architecture redesign
(≤ 1 MB) (Supporting larger (For size, power, and
models) bandwidth limits)
Edge
Regional Tier Cloud Tier End User
(Sensor/Device)
Aggregate Data
Train/Update Model
Send Updated Model
Edge
Regional Tier Cloud Tier End User
(Sensor/Device)
TV Watch on YouTube
Scan with your phone
Google’s Flood Forecasting Initiative demonstrates how the Hierarchical to watch the video
Processing Pattern supports large-scale environmental monitoring. Edge de-
vices along river networks monitor water levels, performing basic anomaly
19.6. Design Patterns 1050
detection even without cloud connectivity. Regional centers aggregate this data
and ensure localized decision-making, while the cloud tier integrates inputs
from multiple regions for advanced flood prediction and system-wide updates.
This tiered approach balances local autonomy with centralized intelligence,
ensuring functionality across diverse infrastructure conditions8 .
8
Google’s Flood Forecasting At the edge tier, the system likely employs water-level sensors and local
Initiative has been instrumental in processing units distributed along river networks. These devices perform two
mitigating flood risks in vulnera- critical functions: continuous monitoring of water levels at regular intervals
ble regions, including parts of In- (e.g., every 15 minutes) and preliminary time-series analysis to detect significant
dia and Bangladesh. By combin- changes. Constrained by the tight power envelope (a few watts of power),
ing real-time sensor data with ma- edge devices utilize quantized models for anomaly detection, enabling low-
chine learning models, the initiative power operation and minimizing the volume of data transmitted to higher
generates precise flood predictions tiers. This localized processing ensures that key monitoring tasks can continue
and timely alerts, reducing disaster- independently of network connectivity.
related losses and enhancing com- The regional tier operates at district-level processing centers, each responsible
munity preparedness. for managing data from hundreds of sensors across its jurisdiction. At this tier,
more sophisticated neural network models are employed to combine sensor
data with additional contextual information, such as local terrain features and
historical flood patterns. This tier reduces the data volume transmitted to the
cloud by aggregating and extracting meaningful features while maintaining
critical decision-making capabilities during network disruptions. By operating
independently when required, the regional tier enhances system resilience and
ensures localized monitoring and alerts remain functional.
At the cloud tier, the system integrates data from regional centers with exter-
nal sources such as satellite imagery and weather data to implement the full
machine learning pipeline. This includes training and running advanced flood
prediction models, generating inundation maps, and distributing predictions
to stakeholders. The cloud tier provides the computational resources needed
for large-scale analysis and system-wide updates. However, the hierarchical
structure ensures that essential monitoring and alerting functions can continue
autonomously at the edge and regional tiers, even when cloud connectivity is
unavailable.
This implementation reveals several key principles of successful Hierarchical
Processing Pattern deployments. First, the careful segmentation of ML tasks
across tiers enables graceful degradation. Each tier maintains critical function-
ality even when isolated. Secondly, the progressive enhancement of capabilities
as higher tiers become available demonstrates how systems can adapt to vary-
ing resource availability. Finally, the bidirectional flow of information, where
sensor data moves upward and model updates flow downward, creates a robust
feedback loop that improves system performance over time. These principles
extend beyond flood forecasting to inform hierarchical ML deployments across
various social impact domains.
19.6.1.2 Structure
The Hierarchical Processing Pattern implements specific architectural compo-
nents and relationships that enable its distributed operation. Understanding
these structural elements is crucial for effective implementation across different
deployment scenarios.
Chapter 19. AI for Good 1051
One of the most significant implications for machine learning is the need to
manage dynamic model behavior across tiers. Unlike static systems, ML models
require regular updates to adapt to new data distributions, prevent model
drift, and maintain accuracy. The hierarchical structure inherently supports
this requirement by allowing the cloud tier to handle centralized training and
model updates while propagating refined models to regional and edge tiers.
However, this introduces challenges in synchronization, as edge and regional
tiers must continue operating with older model versions when updates are
delayed due to connectivity issues. Designing robust versioning systems and
ensuring seamless transitions between model updates is critical to the success
of such systems.
Data flows are another area where machine learning systems impose unique
demands. Unlike traditional hierarchical systems, ML systems must handle
large volumes of data across tiers, ranging from raw inputs at the edge to
aggregated and preprocessed datasets at regional and cloud tiers. Each tier must
be optimized for the specific data-processing tasks it performs. For instance,
edge devices often filter or preprocess raw data to reduce transmission overhead
while retaining information critical for inference. Regional tiers aggregate these
inputs, performing intermediate-level analysis or feature extraction to support
downstream tasks. This multistage data pipeline not only reduces bandwidth
requirements but also ensures that each tier contributes meaningfully to the
overall ML workflow.
The Hierarchical Processing Pattern also enables adaptive inference, a key
consideration for deploying ML models across environments with varying
computational resources. By leveraging the computational capabilities of each
tier, systems can dynamically distribute inference tasks to balance latency,
energy consumption, and accuracy. For example, an edge device might handle
basic anomaly detection to ensure real-time responses, while more sophisticated
inference tasks are ofÒoaded to the cloud when resources and connectivity allow.
This dynamic distribution is essential for resource-constrained environments,
where energy efÏciency and responsiveness are paramount.
Hardware advancements have further shaped the application of the Hierar-
chical Processing Pattern to machine learning. The proliferation of specialized
edge hardware, such as AI accelerators and low-power GPUs, has enabled edge
devices to handle increasingly complex ML tasks, narrowing the performance
gap between tiers. Regional tiers have similarly benefited from innovations such
as federated learning, where models are collaboratively improved across devices
without requiring centralized data collection. These advancements enhance
the autonomy of lower tiers, reducing the dependency on cloud connectivity
and enabling systems to function effectively in decentralized environments.
Finally, machine learning introduces the challenge of balancing local auton-
omy with global coordination. Edge and regional tiers must be able to make
localized decisions based on the data available to them while remaining syn-
chronized with the global state maintained at the cloud tier. This requires
careful design of interfaces between tiers to manage not only data flows but
also model updates, inference results, and feedback loops. For instance, sys-
tems employing federated learning must coordinate the aggregation of locally
19.6. Design Patterns 1054
19.6.1.5 Limitations
Despite its strengths, the Hierarchical Processing Pattern encounters several
fundamental constraints in real-world deployments, particularly when applied
to machine learning systems. These limitations arise from the distributed nature
of the architecture, the variability of resource availability across tiers, and the
inherent complexities of maintaining consistency and efÏciency at scale.
The distribution of processing capabilities introduces significant complexity
in resource allocation and cost management. Regional processing nodes must
navigate trade-offs between local computational needs, hardware costs, and
energy consumption. In battery-powered deployments, the energy efÏciency
of local computation versus data transmission becomes a critical factor. These
constraints directly affect the scalability and operational costs of the system, as
additional nodes or tiers may require significant investment in infrastructure
and hardware.
Time-critical operations present unique challenges in hierarchical systems.
While edge processing reduces latency for local decisions, operations requiring
cross-tier coordination introduce unavoidable delays. For instance, anomaly
detection systems that require consensus across multiple regional nodes face
inherent latency limitations. This coordination overhead can make hierarchical
architectures unsuitable for applications requiring sub-millisecond response
times or strict global consistency.
Training data imbalances across regions create additional complications.
Different deployment environments often generate varying quantities and types
of data, leading to model bias and performance disparities. For example, urban
areas typically generate more training samples than rural regions, potentially
causing models to underperform in less data-rich environments. This imbalance
can be particularly problematic in systems where model performance directly
impacts critical decision-making processes.
System maintenance and debugging introduce practical challenges that grow
with scale. Identifying the root cause of performance degradation becomes
increasingly complex when issues can arise from hardware failures, network
conditions, model drift, or interactions between tiers. Traditional debugging
approaches often prove inadequate, as problems may manifest only under spe-
cific combinations of conditions across multiple tiers. This complexity increases
operational costs and requires specialized expertise for system maintenance.
Chapter 19. AI for Good 1055
19.6.2.2 Structure
The progressive enhancement pattern organizes systems into layered func-
tionalities, each designed to operate within specific resource conditions. This
structure begins with a set of capabilities that function under minimal com-
putational or connectivity constraints, progressively incorporating advanced
features as additional resources become available.
Table 19.2 outlines the resource specifications and capabilities across the
pattern’s three primary layers:
Intermediate Layer
Fallback: Increased
Decreased Resources Resources
Baseline Layer
19.6.2.5 Limitations
While the progressive enhancement pattern offers significant advantages for ML
system deployment, it introduces several technical challenges that impact im-
plementation feasibility and system performance. These challenges particularly
affect model management, resource optimization, and system reliability.
Model version proliferation presents a fundamental challenge. Each enhance-
ment layer typically requires multiple model variants (often 3-5 per layer) to
handle different resource scenarios, creating a combinatorial explosion in model
management overhead. For example, a computer vision system supporting
three enhancement layers might require up to 15 different model versions,
each needing individual maintenance, testing, and validation. This complexity
increases exponentially when supporting multiple tasks or domains.
Performance consistency across enhancement layers introduces significant
technical hurdles. Models operating at the baseline layer (typically limited
to 100-500 KB size) must maintain at least 85-90% of the accuracy achieved
by advanced models while using only 1-5% of the computational resources.
Achieving this efÏciency-accuracy trade-off becomes increasingly difÏcult as
task complexity increases. Systems often struggle to maintain consistent infer-
ence behavior when transitioning between layers, particularly when handling
edge cases or out-of-distribution inputs.
Resource allocation optimization presents another critical limitation. Systems
must continuously monitor and predict resource availability while managing
the overhead of these monitoring systems themselves. The decision-making pro-
cess for switching between enhancement layers introduces additional latency
(typically 50-200 ms), which can impact real-time applications. This overhead
becomes particularly problematic in environments with rapidly fluctuating
resource availability.
Infrastructure dependencies create fundamental constraints on system ca-
pabilities. While baseline functionality operates within minimal requirements
(50-150 mW power consumption, 2G network speeds), achieving full system
potential requires substantial infrastructure improvements. The gap between
baseline and enhanced capabilities often spans several orders of magnitude in
computational requirements, creating significant disparities in system perfor-
mance across deployment environments.
User experience continuity suffers from the inherent variability in system
behavior across enhancement layers. Output quality and response times can
vary significantly—from basic binary classifications at the baseline layer to
detailed probabilistic predictions with confidence intervals at advanced layers.
These variations can undermine user trust, particularly in critical applications
where consistency is essential.
These limitations necessitate careful consideration during system design
and deployment. Successful implementations require robust monitoring sys-
tems, graceful degradation mechanisms, and clear communication of system
19.6. Design Patterns 1060
image data from several megabytes to compact insight vectors of just a few
kilobytes.
The system’s Distributed Knowledge Pattern sharing architecture enables
effective collaboration between nodes despite connectivity limitations. Camera
traps form local mesh networks11 using low-power radio protocols, sharing
11
processed insights rather than raw data. This peer-to-peer communication Mesh Network: A network
allows the network to maintain collective awareness of wildlife movements and topology in which each node relays
potential threats across the monitored area. When one node detects significant data for the network. All nodes co-
activity, including the presence of an endangered species or indications of poach- operate in the distribution of data in
ing, this information propagates through the network, enabling coordinated the network.
responses even in areas with no direct connectivity to central infrastructure.
When periodic connectivity becomes available through satellite or cellular
links, nodes synchronize their accumulated knowledge with cloud infrastruc-
ture. This synchronization process carefully balances the need for data sharing
with bandwidth limitations, employing differential updates and compression
techniques. The cloud tier then applies more sophisticated analytical models
to understand population dynamics and movement patterns across the entire
monitored region.
The Wildlife Insights implementation demonstrates how Distributed Knowl-
edge Pattern sharing can maintain system effectiveness even in challenging
environments. By distributing both processing and decision-making capabili-
ties across the network, the system ensures continuous monitoring and rapid
response capabilities while operating within the severe constraints of remote
wilderness deployments. This approach has proven particularly valuable for
conservation efforts, enabling real-time wildlife monitoring and threat detec-
tion across vast areas that would be impractical to monitor through centralized
systems12 . 12
Camera traps have been
widely used for ecological monitor-
19.6.3.2 Structure ing since the early 20th century. Ini-
The Distributed Knowledge Pattern comprises specific architectural components tially reliant on physical film, they
designed to enable decentralized data collection, processing, and knowledge transitioned to digital and, more re-
sharing. The pattern defines three primary structural elements: autonomous cently, AI-enabled systems, enhanc-
nodes, communication networks, and aggregation mechanisms. ing their ability to automate data
Figure 19.6 illustrates the key components and their interactions within the analysis and extend deployment du-
Distributed Knowledge Pattern. Individual nodes (rectangular shapes) operate rations.
autonomously while sharing insights through defined communication channels.
The aggregation layer (diamond shape) combines distributed knowledge, which
feeds into the analysis layer (oval shape) for processing.
Autonomous nodes form the foundation of the pattern’s structure. Each node
implements three essential capabilities: data acquisition, local processing, and
knowledge sharing. The local processing pipeline typically includes feature
extraction, basic inference, and data filtering mechanisms. This architecture
enables nodes to operate independently while contributing to the network’s
collective intelligence.
The communication layer establishes pathways for knowledge exchange be-
tween nodes. This layer implements both peer-to-peer protocols for direct node
communication and hierarchical protocols for aggregation. The communication
19.6. Design Patterns 1062
Aggregates
Knowledge
Central
Analysis
often centralized, requiring large amounts of data to be sent to the cloud for
processing. With the advent of smaller, more efÏcient machine learning models
designed for edge devices, these models can now be deployed directly on the
nodes themselves. For example, low-power devices such as smartphones or IoT
sensors can run lightweight models for tasks like anomaly detection or image
classification. This enables more sophisticated data analysis at the source,
allowing for quicker decision-making and reducing reliance on central cloud
services.
In terms of network communication, modern mesh networks and 5G tech-
nology have significantly improved the efÏciency and speed of data sharing
between nodes. Mesh networks allow nodes to communicate with each other di-
rectly, forming a self-healing and scalable network. This decentralized approach
to communication ensures that even if a node or connection fails, the network
can still operate seamlessly. With the advent of 5G, the bandwidth and latency
issues traditionally associated with large-scale data transfer in distributed sys-
tems are mitigated, enabling faster and more reliable communication between
nodes in real-time applications.
19.6.3.5 Limitations
While the Distributed Knowledge Pattern offers many advantages, particularly
in decentralized, resource-constrained environments, it also presents several
challenges, especially when applied to machine learning systems. These chal-
lenges stem from the complexity of managing distributed nodes, ensuring data
consistency, and addressing the constraints of decentralized systems.
One of the primary challenges is model synchronization and consistency. In
distributed systems, each node may operate with its own version of a machine
learning model, which is trained using local data. As these models are up-
dated over time, ensuring consistency across all nodes becomes a difÏcult task.
Without careful synchronization, nodes may operate using outdated models,
leading to inconsistencies in the system’s overall performance. Furthermore,
when nodes are intermittently connected or have limited bandwidth, synchro-
nizing model updates across all nodes in real-time can be resource-intensive
and prone to delays.
The issue of data fragmentation is another significant challenge. In a dis-
tributed system, data is often scattered across different nodes, and each node
may have access to only a subset of the entire dataset. This fragmentation can
limit the effectiveness of machine learning models, as the models may not be
exposed to the full range of data needed for training. Aggregating data from
multiple sources and ensuring that the data from different nodes is compatible
for analysis is a complex and time-consuming process. Additionally, because
some nodes may operate in ofÒine modes or have intermittent connectivity,
data may be unavailable for periods, further complicating the process.
Scalability also poses a challenge in distributed systems. As the number of
nodes in the network increases, so does the volume of data generated and the
complexity of managing the system. The system must be designed to handle this
growth without overwhelming the infrastructure or degrading performance.
The addition of new nodes often requires rebalancing data, recalibrating models,
or introducing new coordination mechanisms, all of which can increase the
complexity of the system.
Latency is another issue that arises in distributed systems. While data is
processed locally on each node, real-time decision-making often requires the
aggregation of insights from multiple nodes. The time it takes to share data
and updates between nodes, and the time needed to process that data, can
introduce delays in system responsiveness. In applications like autonomous
systems or disaster response, these delays can undermine the effectiveness of
the system, as immediate action is often necessary.
Finally, security and privacy concerns are magnified in distributed systems.
Since data is often transmitted between nodes or stored across multiple devices,
Chapter 19. AI for Good 1065
Recalibration
Recalibration
Adaptation Feedback
19.6. Design Patterns 1066
In the diagram, when the system is operating under low resources, it switches
to simplified operations, ensuring basic functionality with minimal resource
use. As resources become more available, the system adjusts to medium re-
sources, enabling more moderate operations and optimized functionality. When
resources are abundant, the system can leverage high resources, enabling ad-
vanced operations and full capabilities, such as processing complex data or
running resource-intensive tasks.
The feedback loop is an essential part of this pattern, as it ensures continuous
adjustment based on the system’s resource conditions. This feedback allows
the system to recalibrate and adapt in real-time, scaling resources up or down
to maintain optimal performance.
19.6.4.2 Structure
The Adaptive Resource Pattern revolves around dynamically allocating re-
sources in response to changing environmental conditions, such as network
bandwidth, computational power, or storage. This requires the system to moni-
tor available resources continuously and adjust its operations accordingly to
ensure optimal performance and efÏciency.
It is structured around several key components. First, the system needs a
monitoring mechanism to constantly evaluate the availability of resources. This
can involve checking network bandwidth, CPU utilization, memory usage, or
other relevant metrics. Once these metrics are gathered, the system can then
determine the appropriate course of action—whether it needs to scale up, down,
or adjust its operations to conserve resources.
Next, the system must include an adaptive decision-making process that
interprets these metrics and decides how to allocate resources dynamically.
In high-resource environments, the system might increase the complexity of
tasks, using more powerful computational models or increasing the number of
concurrent processes. Conversely, in low-resource environments, the system
may scale back operations, reduce the complexity of models, or shift some tasks
to local devices (such as edge processing) to minimize the load on the central
infrastructure.
An important part of this structure is the feedback loop, which allows the
system to adjust its resource allocation over time. After making an initial
decision based on available resources, the system monitors the outcome and
adapts accordingly. This process ensures that the system continues to operate
effectively even as resource conditions change. The feedback loop helps the
system fine-tune its resource usage, leading to more efÏcient operations as it
learns to optimize resource allocation.
The system can also be organized into different tiers or layers based on the
complexity and resource requirements of specific tasks. For instance, tasks
requiring high computational resources, such as training machine learning
models or processing large datasets, could be handled by a cloud layer, while
simpler tasks, such as data collection or pre-processing, could be delegated
to edge devices or local nodes. The system can then adapt the tiered struc-
ture based on available resources, allocating more tasks to the cloud or edge
depending on the current conditions.
19.6.4.5 Limitations
The Adaptive Resource Pattern faces several fundamental constraints in practical
implementations, particularly when applied to machine learning systems in
resource-variable environments. These limitations arise from the inherent
complexities of real-time adaptation and the technical challenges of maintaining
system performance across varying resource levels.
Performance predictability presents a primary challenge in adaptive systems.
While adaptation enables systems to continue functioning under varying con-
ditions, it can lead to inconsistent performance characteristics. For example,
when a system transitions from high to low resource availability (e.g., from
8 GB to 500 MB RAM), inference latency might increase from 50 ms to 200 ms.
Managing these performance variations while maintaining minimum quality-
of-service requirements becomes increasingly complex as the range of potential
resource states expands.
State synchronization introduces significant technical hurdles in adaptive
systems. As resources fluctuate, maintaining consistent system state across
components becomes challenging. For instance, when adapting to reduced
network bandwidth (from 50 Mbps to 50 Kbps), systems must manage par-
tial updates and ensure that critical state information remains synchronized.
This challenge is particularly acute in distributed ML systems, where model
states and inference results must remain consistent despite varying resource
conditions.
Resource transition overhead poses another fundamental limitation. Adapt-
ing to changing resource conditions incurs computational and time costs. For
example, switching between different model architectures (from a 50 MB full
model to a 5 MB quantized version) typically requires 100-200 ms of transition
time. During these transitions, system performance may temporarily degrade
or become unpredictable. This overhead becomes particularly problematic in
environments where resources fluctuate frequently.
Quality degradation management presents ongoing challenges, especially
in ML applications. As systems adapt to reduced resources, maintaining ac-
ceptable quality metrics becomes increasingly difÏcult. For instance, model
accuracy might drop from 95% to 85% when switching to lightweight architec-
tures, while energy consumption must stay within strict limits (typically 50-150
mW for edge devices). Finding acceptable trade-offs between resource usage
and output quality requires sophisticated optimization strategies.
These limitations necessitate careful system design and implementation
strategies. Successful deployments often implement robust monitoring systems,
graceful degradation mechanisms, and clear quality thresholds for different
resource states. While these challenges don’t negate the pattern’s utility, they
emphasize the importance of thorough planning and realistic performance
expectations in adaptive system deployments.
High Scalability/Adaptability
Figure 19.8: Quadrant mapping of
design patterns for AI for Social Google’s Flood Forecasting
Good projects based on resource
availability and scalability/adapt-
PlantVillage Nuru Global Fishing
ability needs. Watch
Medic Mobile
Low Scalability/Adaptability
PlantVillage
Nuru
Wildlife Insights Famine Action
Mechanism
WildEyes AI
Design
Pattern Core Idea Strengths Challenges Best Use Case
Adap- Dynamically adjusts Resource efÏciency Predicting resource Real-time systems operating
tive operations based on and real-time demand and under fluctuating resource
Resource resource availability. adaptability managing conditions (e.g., disaster
trade-offs between response systems).
performance and
simplicity.
The implementation approach for each pattern should align with both its
position in the resource-adaptability space and its core characteristics. In low-
resource, high-adaptability environments, Progressive Enhancement imple-
mentations focus on establishing reliable baseline capabilities that can scale
smoothly as resources become available. This often involves careful coordi-
nation between local processing and cloud resources, ensuring that systems
maintain functionality even when operating at minimal resource levels.
Hierarchical Processing Pattern implementations, suited for environments
with more stable infrastructure, require careful attention to the interfaces be-
tween tiers. The key challenge lies in managing the flow of data and model
updates across the hierarchy while maintaining system responsiveness. This
becomes particularly critical in social impact applications where real-time re-
sponse capabilities often determine intervention effectiveness.
Distributed Knowledge Pattern implementations emphasize resilient peer-
to-peer operations, particularly important in environments where centralized
coordination isn’t feasible. Success depends on establishing efÏcient knowledge-
sharing protocols that maintain system effectiveness while operating within
strict resource constraints. This pattern’s implementation often requires careful
balance between local autonomy and network-wide consistency.
The Adaptive Resource Pattern implementations focus on dynamic resource
management, particularly crucial in environments with fluctuating resource
availability. These systems require sophisticated monitoring and control mech-
anisms that can adjust operations in real-time while maintaining essential
functionality. The implementation challenge lies in managing these transitions
smoothly without disrupting critical operations.
19.8 Conclusion
The potential of AI for addressing societal challenges is undeniable. However,
the path to successful deployment is anything but straightforward. ML sys-
tems for social good are not “plug-and-play” solutions, as they are complex
engineering endeavors.
These systems must be tailored to operate under severe constraints, such
as limited power, unreliable connectivity, and sparse data, all while meeting
the needs of underserved communities. Designing for these environments is
as rigorous and demanding as developing systems for urban deployments,
often requiring even more ingenuity to overcome unique challenges. Every
componen, from data collection to model deployment, must be reimagined to
suit these constraints and deliver meaningful outcomes.
Machine learning systems for social impact necessitate the systematic appli-
cation of design patterns to address these unique complexities. The patterns
examined in this chapter, including Hierarchical Processing, Progressive En-
hancement, Distributed Knowledge, and Adaptive Resource, establish frame-
works for addressing these challenges while ensuring systems remain effective
and sustainable across diverse deployment contexts.
The implementation of these patterns depends fundamentally on a com-
prehensive understanding of both the operational environment and system
requirements. Resource availability and adaptability requirements typically
determine initial pattern selection, while specific implementation decisions
must account for network reliability, computational constraints, and scalability
requirements. The efÏcacy of social impact applications depends not only on
pattern selection but on implementation strategies that address local constraints
while maintaining system performance.
These patterns will evolve as technological capabilities advance and deploy-
ment contexts transform. Developments in edge computing, federated learning,
and adaptive ML architectures will expand the potential applications of these
patterns, particularly in resource-constrained environments. However, the core
principles, such as accessibility, reliability, and scalability, remain fundamental
to developing ML systems that generate meaningful social impact.
Chapter 19. AI for Good 1075
19.9 Resources
Slides
• Coming soon.
çĖ Videos
• Coming soon.
¸Î Exercises
• Coming soon.
Chapter 20
Conclusion
20.1 Overview
This book examines the rapidly evolving field of ML systems. We focused on
systems because while there are many resources on ML models and algorithms,
more needs to be understood about how to build the systems that run them.
To draw an analogy, consider the process of building a car. While many
resources are available on the various components of a car, such as the engine,
transmission, and suspension, there is often a need for more understanding
about how to assemble these components into a functional vehicle. Just as a car
requires a well-designed and properly integrated system to operate efÏciently
and reliably, ML models also require a robust and carefully constructed system
to deliver their full potential. Moreover, there is a lot of nuance in building ML
systems, given their specific use case. For example, a Formula 1 race car must
be assembled differently from an everyday Prius consumer car.
1077
20.2. ML Dataset Importance 1078
Our journey started by tracing ML’s historical trajectory, from its theoretical
foundations to its current state as a transformative force across industries. We
explored the building blocks of machine learning models and demonstrated
how their architectures, when examined through the lens of computer architec-
ture, reveal structural similarities.
Throughout this book, we have looked into the intricacies of ML systems,
examining the critical components and best practices necessary to create a
seamless and efÏcient pipeline. From data preprocessing and model training to
deployment and monitoring, we have provided insights and guidance to help
readers navigate the complex landscape of ML system development.
ML systems involve complex workflows, spanning various topics from data
engineering to model deployment on diverse systems. By providing an overview
of these ML system components, we have aimed to showcase the tremendous
depth and breadth of the field and expertise that is needed. Understanding the
intricacies of ML workflows is crucial for practitioners and researchers alike,
as it enables them to navigate the landscape effectively and develop robust,
efÏcient, and impactful ML solutions.
By focusing on the systems aspect of ML, we aim to bridge the gap between
theoretical knowledge and practical implementation. Just as a healthy human
body system allows the organs to function optimally, a well-designed ML system
enables the models to consistently deliver accurate and reliable results. This
book’s goal is to empower readers with the knowledge and tools necessary to
build ML systems that showcase the underlying models’ power and ensure
smooth integration and operation, much like a well-functioning human body.
that define them. We also looked into the specialization of frameworks tailored
to specific needs, such as those designed for embedded AI. We discussed the
criteria for selecting the most suitable framework for a given project.
Our exploration also touched upon the future trends expected to shape the
landscape of ML frameworks in the coming years. As the field continues to
evolve, we can anticipate the emergence of more specialized and optimized
frameworks that cater to the unique requirements of different domains and
deployment scenarios, as we saw with TensorFlow Lite for Microcontrollers. By
staying abreast of these developments and understanding the tradeoffs involved
in framework selection, we can make informed decisions and leverage the most
appropriate tools to build efÏcient ML systems.
paramount as they become more integrated into our lives and decision-making
processes.
As AI systems become more pervasive and influential, it is important to
ensure that they are designed and deployed in a manner that upholds ethical
principles. This means actively mitigating biases, promoting fairness, and
preventing discriminatory outcomes. Additionally, ethical AI design ensures
transparency in how AI systems make decisions, enabling users to understand
and trust their outputs.
Accountability is another critical ethical consideration. As AI systems take on
more responsibilities and make decisions that impact individuals and society,
there must be clear mechanisms for holding these systems and their creators
accountable. This includes establishing frameworks for auditing and monitor-
ing AI systems and defining liability and redress mechanisms in case of harm
or unintended consequences.
Ethical frameworks, regulations, and standards will be essential to address
these ethical challenges. These frameworks should guide the responsible de-
velopment and deployment of AI technologies, ensuring that they align with
societal values and promote the well-being of individuals and communities.
Moreover, ongoing discussions and collaborations among researchers, prac-
titioners, policymakers, and society will be important in navigating the ethical
landscape of AI. These conversations should be inclusive and diverse, bringing
together different perspectives and expertise to develop comprehensive and
equitable solutions. As we move forward, it is the collective responsibility of
all stakeholders to prioritize ethical considerations in the development and
deployment of AI systems.
20.12 Sustainability
The increasing computational demands of machine learning, particularly for
training large models, have raised concerns about their environmental impact
due to high energy consumption and carbon emissions. As the scale and com-
plexity of models continue to grow, addressing the sustainability challenges
associated with AI development becomes imperative. To mitigate the envi-
ronmental footprint of AI, the development of energy-efÏcient algorithms is
necessary. This involves optimizing models and training procedures to mini-
mize computational requirements while maintaining performance. Techniques
such as model compression, quantization, and efÏcient neural architecture
search can help reduce the energy consumption of AI systems.
Using renewable energy sources to power AI infrastructure is another im-
portant step towards sustainability. By transitioning to clean energy sources
such as solar, wind, and hydropower, the carbon emissions associated with
AI development can be significantly reduced. This requires a concerted effort
from the AI community and support from policymakers and industry leaders
to invest in and adopt renewable energy solutions. In addition, exploring alter-
native computing paradigms, such as neuromorphic and photonic computing,
holds promise for developing more energy-efÏcient AI systems. By developing
hardware and algorithms that emulate the brain’s processing mechanisms, we
can potentially create AI systems that are both powerful and sustainable.
20.13. Robustness and Resiliency 1084
20.16 Congratulations
Congratulations on coming this far, and best of luck in your future endeavors!
The future of AI is bright and filled with endless possibilities. It will be exciting
to see the incredible contributions you will make to this field.
Feel free to reach out to me anytime at vj at eecs dot harvard dot edu.
– Prof. Vijay Janapa Reddi, Harvard University
LABS
1087
Overview
Welcome to the hands-on labs section, where you’ll explore deploying machine
learning (ML) models onto real embedded devices, offering a practical intro-
duction to ML systems. Unlike traditional approaches with large-scale models,
these labs focus on interacting directly with both hardware and software. They
help us showcase various sensor modalities across different application use
cases. This approach provides valuable insights into the challenges and oppor-
tunities of deploying AI on real physical systems.
Learning Objectives
By completing these labs, we hope learners will:
L Tip
Target Audience
These labs are designed for:
1089
Supported Devices 1090
Supported Devices
We have included laboratory materials for three key devices that represent
different hardware profiles and capabilities.
• Nicla Vision: Optimized for vision-based applications like image classifi-
cation and object detection, ideal for compact, low-power use cases. It is
also suitable for keyword spotting and motion detection tasks.
• XIAO ESP32S3: A versatile, compact board suitable for vision, keyword
spotting, and motion detection tasks.
• Grove Vision AI V2: Equipped with a dedicated Neural Processing Unit
(NPU), this device enables more advanced machine learning tasks with en-
hanced on-device inference capabilities, making it ideal for sophisticated
computer vision and AI applications.
• Raspberry Pi: A flexible platform for more computationally intensive
tasks, including small language models and various classification and
detection applications.
Lab Structure
Each lab follows a structured approach:
1. Introduction: Explore the application and its significance in real-world
scenarios.
2. Setup: Step-by-step instructions to configure the hardware and software
environment.
3. Deployment: Guidance on training and deploying the pre-trained ML
models on supported devices.
4. Exercises: Hands-on tasks to modify and experiment with model param-
eters.
5. Discussion: Analysis of results, potential improvements, and practical
insights.
Overview 1091
Credits
Special credit and thanks to Prof. Marcelo Rovai for his valuable contributions
to the development and continuous refinement of these labs.
Getting Started
Hardware Requirements
To follow along with the hands-on labs, you’ll need the following hardware:
1. Arduino Nicla Vision board
• The Arduino Nicla Vision is a powerful, compact board designed
for professional-grade computer vision and audio applications. It
features a high-quality camera module, a digital microphone, and an
IMU, making it suitable for demanding projects in industries such
as robotics, automation, and surveillance.
• Arduino Nicla Vision specifications
• Arduino Nicla Vision pinout diagram
2. XIAO ESP32S3 Sense board
• The Seeed Studio XIAO ESP32S3 Sense is a tiny, feature-packed board
designed for makers, hobbyists, and students interested in exploring
edge AI applications. It comes equipped with a camera, microphone,
and IMU, making it easy to get started with projects such as image
classification, keyword spotting, and motion detection.
• XIAO ESP32S3 Sense specifications
• XIAO ESP32S3 Sense pinout diagram
3. Grove Vision AI V2 board
• The Seeed Studio Grove Vision AI V2 is a compact, low-power, yet
powerful device. It is an MCU-based system powered by the Arm
Cortex-M55 and vision AI module Ethos-U55. It supports Tensor-
Flow and PyTorch frameworks and is compatible with the Arduino
IDE. With the SenseCraft AI algorithm platform, trained machine
learning (ML) models can be deployed to the sensor without the
need for coding. It features a standard CSI interface, an onboard
1093
Software Requirements 1094
Software Requirements
To program the boards and develop embedded machine learning projects, you’ll
need the following software:
1. Arduino IDE
• Download and install
– Install Arduino IDE
– Follow the installation guide for your specific OS.
– Arduino CLI
– Configure the Arduino IDE for the Arduino Nicla Vision and
XIAO ESP32S3 Sense boards.
Network Connectivity
Some projects may require internet connectivity for data collection or model de-
ployment. Ensure your development environment connection is stable through
Wi-Fi or Ethernet. For the Raspberry Pi, having a Wi-Fi or Ethernet connection
is necessary for remote operation without the necessity to plug in a monitor,
keyboard, and mouse.
• For the Arduino Nicla Vision, you can use the onboard Wi-Fi module to
connect to a wireless network.
• For the XIAO ESP32S3 Sense, you can use the onboard Wi-Fi module or
connect an external Wi-Fi or Ethernet module using the available pins.
• For the Grove Vision AI V2, you can use the onboard Wi-Fi module on the
Master Controller (for example the XIAO ESP32S3) or connect an external
Wi-Fi or Ethernet module using the available pins.
• For the Raspberry Pi, you can use the onboard Wi-Fi module to connect
an external Wi-Fi or Ethernet module using the available connector.
Conclusion
With your hardware and software set up, you’re ready to embark on your em-
bedded machine learning journey. The hands-on labs will guide you through
various projects, covering topics like image classification, object detection, key-
word spotting, and motion classification.
If you encounter any issues or have questions, don’t hesitate to consult the
troubleshooting guides or forums or seek support from the community.
Let’s dive in and unlock the potential of ML on real (tiny) systems!
Nicla Vision
Pre-requisites
• Nicla Vision Board: Ensure you have the Nicla Vision board.
• USB Cable: For connecting the board to your computer.
• Network: With internet access for downloading necessary software.
Setup
• Setup Nicla Vision
1097
Exercises 1098
Exercises
Overview
The Arduino Nicla Vision (sometimes called NiclaV) is a development board
that includes two processors that can run tasks in parallel. It is part of a family
of development boards with the same form factor but designed for specific
tasks, such as the Nicla Sense ME and the Nicla Voice. The Niclas can efÏciently
1099
Hardware 1100
run processes created with TensorFlow Lite. For example, one of the cores
of the NiclaV runs a computer vision algorithm on the fly (inference). At the
same time, the other executes low-level operations like controlling a motor and
communicating or acting as a user interface. The onboard wireless module
allows the simultaneous management of WiFi and Bluetooth Low Energy (BLE)
connectivity.
Hardware
Two Parallel Cores
The central processor is the dual-core STM32H747, including a Cortex M7 at 480
MHz and a Cortex M4 at 240 MHz. The two cores communicate via a Remote
Procedure Call mechanism that seamlessly allows calling functions on the other
processor. Both processors share all the on-chip peripherals and can run:
• Arduino sketches on top of the Arm Mbed OS
• Native Mbed applications
• MicroPython / JavaScript via an interpreter
• TensorFlow Lite
Memory
Memory is crucial for embedded machine learning projects. The NiclaV board
can host up to 16 MB of QSPI Flash for storage. However, it is essential to
Setup 1101
consider that the MCU SRAM is the one to be used with machine learning
inferences; the STM32H747 is only 1 MB, shared by both processors. This MCU
also has incorporated 2 MB of FLASH, mainly for code storage.
Sensors
• Camera: A GC2145 2 MP Color CMOS Camera.
• Microphone: The MP34DT05 is an ultra-compact, low-power, omnidirec-
tional, digital MEMS microphone built with a capacitive sensing element
and the IC interface.
• 6-Axis IMU: 3D gyroscope and 3D accelerometer data from the LSM6DSOX
6-axis IMU.
• Time of Flight Sensor: The VL53L1CBV0FY Time-of-Flight sensor adds
accurate and low-power-ranging capabilities to Nicla Vision. The invisible
near-infrared VCSEL laser (including the analog driver) is encapsulated
with receiving optics in an all-in-one small module below the camera.
Install the Mbed OS core for Nicla boards in the Arduino IDE. Having the IDE
open, navigate to Tools > Board > Board Manager, look for Arduino Nicla
Vision on the search window, and install the board.
Next, go to Tools > Board > Arduino Mbed OS Nicla Boards and select
Arduino Nicla Vision. Having your board connected to the USB, you should
see the Nicla on Port and select it.
Arduino IDE Installation 1102
Vary the frequency of the sound you generate and confirm that the
mic is working correctly.
On the Serial Monitor, you will see the distance from the camera to an object
in front of it (max of 4 m).
We can also test the camera using, for example, the code provided on Examples
> Camera > CameraCaptureRawBytes. We cannot see the image directly, but
we can get the raw image data generated by the camera.
We can use the Web Serial Camera (API) to see the image generated by the
camera. This web application streams the camera image over Web Serial from
camera-equipped Arduino boards.
The Web Serial Camera example shows you how to send image data over
the wire from your Arduino board and how to unpack the data in JavaScript
for rendering. In addition, in the source code of the web application, we can
find some example image filters that show us how to manipulate pixel data to
achieve visual effects.
The web application for displaying the camera image can be accessed here.
We may also look at [this tutorial, which explains the setup in more detail.
Setup 1105
The IDE should open, defaulting to the helloworld_1.py code on its Code Area.
If not, you can open it from Files > Examples > HelloWord > helloword.py
Installing the OpenMV IDE 1106
Any messages sent through a serial connection (using print() or error mes-
sages) will be displayed on the Serial Terminal during run time. The image
captured by a camera will be displayed in the Camera Viewer Area (or Frame
Buffer) and in the Histogram area, immediately below the Camera Viewer.
Before connecting the Nicla to the OpenMV IDE, ensure you have the latest boot-
loader version. Go to your Arduino IDE, select the Nicla board, and open the
sketch on Examples > STM_32H747_System STM32H747_manageBootloader. Up-
load the code to your board. The Serial Monitor will guide you.
After updating the bootloader, put the Nicla Vision in bootloader mode by
double-pressing the reset button on the board. The built-in green LED will start
fading in and out. Now return to the OpenMV IDE and click on the connect
icon (Left ToolBar):
A pop-up will tell you that a board in DFU mode was detected and ask
how you would like to proceed. First, select Install the latest release
firmware (vX.Y.Z). This action will install the latest OpenMV firmware on
the Nicla Vision.
Setup 1107
You can leave the option Erase internal file system unselected and click
[OK].
Nicla’s green LED will start flashing while the OpenMV firmware is uploaded
to the board, and a terminal window will then open, showing the flashing
progress.
Wait until the green LED stops flashing and fading. When the process ends,
you will see a message saying, “DFU firmware update complete!”. Press [OK].
Installing the OpenMV IDE 1108
A green play button appears when the Nicla Vison connects to the Tool Bar.
Also, note that a drive named “NO NAME” will appear on your computer.
Every time you press the [RESET] button on the board, the main.py script
stored on it automatically executes. You can load the main.py code on the IDE
(File > Open File...).
Setup 1109
while(True):
clock.tick() # Update the FPS clock.
img = sensor.snapshot() # Take a picture and return
# the image.
print(clock.fps())
Note: OpenMV Cam runs about half as fast when connected to the
IDE. The FPS should increase once disconnected.
In the GitHub, You can find other Python scripts. Try to test the onboard
sensors.
• Executing the specific batch code for your OS will upload the binary
arduino-nicla-vision.bin to your board.
You can choose which sensor data to pick in the Collect Data section on the
Data Acquisition tab.
Expanding the Nicla Vision Board (optional) 1112
Or Image (Camera):
You can also test an external sensor connected to the ADC (Nicla pin 0)
and the other onboard sensors, such as the built-in microphone, the ToF
(Proximity) or a combination of sensors (fusion).
Note that all 17 Nicla Vision pins will be connected to the Shield
Groves, but some Grove connections remain disconnected.
Setup 1113
This shield is MKR compatible and can be used with the Nicla Vision and
Portenta.
For example, suppose that on a TinyML project, you want to send inference
results using a LoRaWAN device and add information about local luminosity.
Often, with ofÒine operations, a local low-power display such as an OLED is
advised. This setup can be seen here:
Expanding the Nicla Vision Board (optional) 1114
The Grove Light Sensor would be connected to one of the single Analog
pins (A0/PC4), the LoRaWAN device to the UART, and the OLED to the I2C
connector.
The Nicla Pins 3 (Tx) and 4 (Rx) are connected with the Serial Shield connector.
The UART communication is used with the LoRaWan device. Here is a simple
code to use the UART:
import time
from pyb import UART
from pyb import LED
while(True):
uart.write("Hello World!\r\n")
redLED.toggle()
time.sleep_ms(1000)
To verify that the UART is working, you should, for example, connect another
device as the Arduino UNO, displaying “Hello Word” on the Serial Monitor.
Here is the code.
Setup 1115
Below is the Hello World code to be used with the I2C OLED. The MicroPy-
thon SSD1306 OLED driver (ssd1306.py), created by Adafruit, should also be
uploaded to the Nicla (the ssd1306.py script can be found in GitHub).
oled_width = 128
oled_height = 64
oled = ssd1306.SSD1306_I2C(oled_width, oled_height, i2c)
Finally, here is a simple script to read the ADC value on pin “PC4” (Nicla
pin A0):
import pyb
from time import sleep
Conclusion 1116
while (True):
val = adc.read()
print ("Light={}".format (val))
sleep (1)
The ADC can be used for other sensor variables, such as Temperature.
Conclusion
The Arduino Nicla Vision is an excellent tiny device for industrial and profes-
sional uses! However, it is powerful, trustworthy, low power, and has suitable
sensors for the most common embedded machine learning applications such
as vision, movement, sensor fusion, and sound.
On the GitHub repository, you will find the last version of all the
code used or commented on in this hands-on lab.
Resources
• Micropython codes
• Arduino Codes
Image Classification
Overview
As we initiate our studies into embedded machine learning or TinyML, it’s
impossible to overlook the transformative impact of Computer Vision (CV) and
Artificial Intelligence (AI) in our lives. These two intertwined disciplines rede-
1117
Computer Vision 1118
fine what machines can perceive and accomplish, from autonomous vehicles
and robotics to healthcare and surveillance.
More and more, we are facing an artificial intelligence (AI) revolution where,
as stated by Gartner, Edge AI has a very high impact potential, and it is for
now!
In the “bullseye” of the Radar is the Edge Computer Vision, and when we talk
about Machine Learning (ML) applied to vision, the first thing that comes to
mind is Image Classification, a kind of ML “Hello World”!
This lab will explore a computer vision project utilizing Convolutional Neural
Networks (CNNs) for real-time image classification. Leveraging TensorFlow’s
robust ecosystem, we’ll implement a pre-trained MobileNet model and adapt it
for edge deployment. The focus will be optimizing the model to run efÏciently
on resource-constrained hardware without sacrificing accuracy.
We’ll employ techniques like quantization and pruning to reduce the compu-
tational load. By the end of this tutorial, you’ll have a working prototype capable
of classifying images in real-time, all running on a low-power embedded system
based on the Arduino Nicla Vision board.
Computer Vision
At its core, computer vision enables machines to interpret and make decisions
based on visual data from the world, essentially mimicking the capability of the
human optical system. Conversely, AI is a broader field encompassing machine
learning, natural language processing, and robotics, among other technologies.
When you bring AI algorithms into computer vision projects, you supercharge
the system’s ability to understand, interpret, and react to visual stimuli.
When discussing Computer Vision projects applied to embedded devices,
the most common applications that come to mind are Image Classification and
Object Detection.
Image Classification 1119
Both models can be implemented on tiny devices like the Arduino Nicla
Vision and used on real projects. In this chapter, we will cover Image Classifica-
tion.
Data Collection
Once we have defined our Machine Learning project goal, the next and most
crucial step is collecting the dataset. For image capturing, we can use:
• Web Serial Camera tool,
• Edge Impulse Studio,
• OpenMV IDE,
• A smartphone.
Here, we will use the OpenMV IDE.
The IDE will ask us to open the file where the data will be saved. Choose
the “data” folder that was created. Note that new icons will appear on the Left
panel.
Using the upper icon (1), enter with the first class name, for example, “periq-
uito”:
The stored images use a QVGA frame size of 320 × 240 and the RGB565 (color
pixel format).
After capturing the dataset, close the Dataset Editor Tool on the Tools >
Dataset Editor.
We will end up with a dataset on our computer that contains three classes:
periquito, robot, and background.
We should return to Edge Impulse Studio and upload the dataset to our created
project.
We will use the Edge Impulse Studio to train our model. Enter the account
credentials and create a new project:
Dataset 1122
Dataset
Using the EI Studio (or Studio), we will go over four main steps to have our
model ready for use on the Nicla Vision board: Dataset, Impulse, Tests, and
Deploy (on the Edge Device, in this case, the NiclaV).
Regarding the Dataset, it is essential to point out that our Original Dataset,
captured with the OpenMV IDE, will be split into Training, Validation, and Test.
The Test Set will be spared from the beginning and reserved for use only in the
Test phase after training. The Validation Set will be used during training.
Leave to the Studio the splitting of the original dataset into train and test and
choose the label about that specific data:
Dataset 1124
At the end, you should see your “raw data” in the Studio:
Note that when you start to upload the data, a pop-up window can appear,
asking if you are building an Object Detection project. Select [NO].
We can always change it in the Dashboard section: One label per data
item (Image Classification):
Image Classification 1125
Optionally, the Studio allows us to explore the data, showing a complete view
of all the data in the project. We can clear, inspect, or change labels by clicking
on individual data items. In our case, the data seems OK.
By leveraging these learned features, you can train a new model for your
specific task with fewer data and computational resources and yet achieve
competitive accuracy.
Image Pre-Processing
All the input QVGA/RGB565 images will be converted to 27,640 features (96 ×
96 × 3).
Model Design
In 2007, Google introduced MobileNetV1, a family of general-purpose com-
puter vision neural networks designed with mobile devices in mind to support
classification, detection, and more. MobileNets are small, low-latency, low-
power models parameterized to meet the resource constraints of various use
cases. in 2018, Google launched MobileNetV2: Inverted Residuals and Linear
Bottlenecks.
MobileNet V1 and MobileNet V2 aim at mobile efÏciency and embedded
vision applications but differ in architectural complexity and performance.
While both use depthwise separable convolutions to reduce the computational
cost, MobileNet V2 introduces Inverted Residual Blocks and Linear Bottle-
necks to improve performance. These new features allow V2 to capture more
complex features using fewer parameters, making it computationally more
efÏcient and generally more accurate than its predecessor. Additionally, V2
employs a non-linear activation in the intermediate expansion layer. It still uses
a linear activation for the bottleneck layer, a design choice found to preserve
important information through the network. MobileNet V2 offers an optimized
architecture for higher accuracy and efÏciency and will be used in this project.
Although the base MobileNet architecture is already tiny and has low latency,
many times, a specific use case or application may require the model to be
even smaller and faster. MobileNets introduces a straightforward parameter 𝛼
(alpha) called width multiplier to construct these smaller, less computationally
expensive models. The role of the width multiplier 𝛼 is that of thinning a
network uniformly at each layer.
Edge Impulse Studio can use both MobileNetV1 (96 × 96 images) and V2
(96 × 96 or 160 × 160 images), with several different 𝛼 values (from 0.05 to 1.0).
For example, you will get the highest accuracy with V2, 160 × 160 images, and
𝛼 = 1.0. Of course, there is a trade-off. The higher the accuracy, the more
memory (around 1.3 MB RAM and 2.6 MB ROM) will be needed to run the
model, implying more latency. The smaller footprint will be obtained at the
other extreme with MobileNetV1 and 𝛼 = 0.10 (around 53.2 K RAM and 101 K
ROM).
Image Classification 1129
We will use MobileNetV2 96x96 0.1 ( or 0.05) for this project, with an esti-
mated memory cost of 265.3 KB in RAM. This model should be OK for the Nicla
Vision with 1MB of SRAM. On the Transfer Learning Tab, select this model:
Model Training
Another valuable technique to be used with Deep Learning is Data Augmen-
tation. Data augmentation is a method to improve the accuracy of machine
learning models by creating additional artificial data. A data augmentation
system makes small, random changes to your training data during the training
process (such as flipping, cropping, or rotating the images).
Looking under the hood, here you can see how Edge Impulse implements a
data Augmentation policy on your data:
Exposure to these variations during training can help prevent your model
from taking shortcuts by “memorizing” superficial clues in your training data,
meaning it may better reflect the deep underlying patterns in your dataset.
The final layer of our model will have 12 neurons with a 15% dropout for
overfitting prevention. Here is the Training result:
Model Testing
Image Classification 1131
Now, we should take the data set put aside at the start of the project and run
the trained model using it as input:
Arduino Library
First, Let’s deploy it as an Arduino Library:
We should install the library as.zip on the Arduino IDE and run the sketch
nicla_vision_camera.ino available in Examples under the library name.
OpenMV
It is possible to deploy the trained model to be used with OpenMV in two ways:
as a library and as a firmware (FW). Choosing FW, the Edge Impulse Studio
generates optimized models, libraries, and frameworks needed to make the
inference. Let’s explore this option.
Select OpenMV Firmware on the Deploy Tab and press [Build].
Use the Bootloader tool on the OpenMV IDE to load the FW on your board
(1):
Run it. Pointing the camera to the objects we want to classify, the inference
result will be displayed on the Serial Terminal.
Deploying the model 1136
import time
While True:
...
time.sleep_ms(200) # Delay for .2 second
import sensor
import time
import ml
clock = time.clock()
while True:
clock.tick()
img = sensor.snapshot()
fps = clock.fps()
lat = clock.avg()
print("**********\nPrediction:")
# Combines labels & confidence into a list of tuples and then
# sorts that list by the confidence values.
sorted_list = sorted(
zip(model.labels, model.predict([img])[0].flatten().tolist()),
key=lambda x: x[1], reverse=True
)
# Draw the label with the highest probability to the image viewer
img.draw_string(
10, 10,
max_lbl + "\n{:.2f}".format(max_val),
mono_space = False,
scale=3
)
Note that the latency (136 ms) is almost double of what we got directly with
the Arduino IDE. This is because we are using the IDE as an interface and also
the time to wait for the camera to be ready. If we start the clock just before the
inference, the latency should drop to around 70 ms.
The NiclaV runs about half as fast when connected to the IDE. The
FPS should increase once disconnected.
To accomplish that, we should upload the code from GitHub or change the
last code to include the LEDs:
ledRed = LED("LED_RED")
ledGre = LED("LED_GREEN")
ledBlu = LED("LED_BLUE")
ledRed.off()
ledGre.off()
ledBlu.off()
clock = time.clock()
def setLEDs(max_lbl):
if max_lbl == 'uncertain’:
ledRed.on()
ledGre.off()
ledBlu.off()
if max_lbl == 'periquito’:
ledRed.off()
ledGre.on()
ledBlu.off()
if max_lbl == 'robot’:
ledRed.off()
ledGre.off()
ledBlu.on()
if max_lbl == 'background’:
ledRed.off()
ledGre.off()
ledBlu.off()
while True:
img = sensor.snapshot()
clock.tick()
fps = clock.fps()
lat = clock.avg()
print("**********\nPrediction:")
Deploying the model 1140
sorted_list = sorted(
zip(model.labels, model.predict([img])[0].flatten().tolist()),
key=lambda x: x[1], reverse=True
)
# Draw the label with the highest probability to the image viewer
img.draw_string(
10, 10,
max_lbl + "\n{:.2f}".format(max_val),
mono_space = False,
scale=3
)
setLEDs(max_lbl)
time.sleep_ms(200) # Delay for .2 second
Now, each time that a class scores a result greater than 0.8, the correspondent
LED will be lit:
• Led Red 0n: Uncertain (no class is over 0.8)
• Led Green 0n: Periquito > 0.8
• Led Blue 0n: Robot > 0.8
• All LEDs Off: Background > 0.8
Here is the result:
Image Classification 1141
In more detail
Catching the opportunity, the same trained model was deployed on the ESP-
CAM, the XIAO, and the Portenta (in this one, the model was trained again,
using grayscaled images to be compatible with its camera). Here is the result,
deploying the models as Arduino’s Library:
Conclusion 1142
Conclusion
Before we finish, consider that Computer Vision is more than just image classifi-
cation. For example, you can develop Edge Machine Learning projects around
vision in several areas, such as:
• Autonomous Vehicles: Use sensor fusion, lidar data, and computer vision
algorithms to navigate and make decisions.
• Healthcare: Automated diagnosis of diseases through MRI, X-ray, and
CT scan image analysis
• Retail: Automated checkout systems that identify products as they pass
through a scanner.
• Security and Surveillance: Facial recognition, anomaly detection, and
object tracking in real-time video feeds.
• Augmented Reality: Object detection and classification to overlay digital
information in the real world.
• Industrial Automation: Visual inspection of products, predictive mainte-
nance, and robot and drone guidance.
• Agriculture: Drone-based crop monitoring and automated harvesting.
• Natural Language Processing: Image captioning and visual question
answering.
• Gesture Recognition: For gaming, sign language translation, and human-
machine interaction.
• Content Recommendation: Image-based recommendation systems in
e-commerce.
Resources
• Micropython codes
• Dataset
• Edge Impulse Project
Object Detection
Overview
1143
Overview 1144
The main task with Image Classification models is to produce a list of the most
probable object categories present on an image, for example, to identify a tabby
cat just after his dinner:
But what happens when the cat jumps near the wine glass? The model still
only recognizes the predominant category on the image, the tabby cat:
Object Detection 1145
The model identifies the above image utterly wrong as an “ashcan,” possibly
due to the color tonalities.
To solve this issue, we need another type of model, where not only multiple
categories (or labels) can be found but also where the objects are located on a
given image.
As we can imagine, such models are much more complicated and bigger, for
example, the MobileNetV2 SSD FPN-Lite 320x320, trained with the COCO
dataset. This pre-trained object detection model is designed to locate up to 10
objects within an image, outputting a bounding box for each object detected.
The below image is the result of such a model running on a Raspberry Pi:
The Object Detection Project Goal 1146
Those models used for object detection (such as the MobileNet SSD or YOLO)
usually have several MB in size, which is OK for Raspberry Pi but unsuitable
for use with embedded devices, where the RAM is usually lower than 1 Mbyte.
Edge Impulse launched in 2022, FOMO (Faster Objects, More Objects), a novel
solution for performing object detection on embedded devices, not only on
the Nicla Vision (Cortex M7) but also on Cortex M4F CPUs (Arduino Nano33
and OpenMV M4 series) and the Espressif ESP32 devices (ESP-CAM and XIAO
ESP32S3 Sense).
In this Hands-On lab, we will explore using FOMO with Object Detection,
not entering many details about the model itself. To understand more about
how the model works, you can go into the ofÏcial FOMO announcement by
Edge Impulse, where Louis Moreau and Mat Kelcey explain in detail how it
works.
All Machine Learning projects need to start with a detailed goal. Let’s assume
we are in an industrial facility and must sort and count wheels and special
boxes.
Object Detection 1147
We are interested in which object is in the image, its location (centroid), and
how many we can find on it. The object’s size is not detected with FOMO, as
with MobileNet SSD or YOLO, where the Bounding Box is one of the model
outputs.
We will develop the project using the Nicla Vision for image capture and
model inference. The ML project will be developed using the Edge Impulse
Studio. But before starting the object detection project in the Studio, let’s create
a raw dataset (not labeled) with images that contain the objects to be detected.
Data Collection
For image capturing, we can use:
• Web Serial Camera tool,
• Edge Impulse Studio,
• OpenMV IDE,
• A smartphone.
Here, we will use the OpenMV IDE.
Data Collection 1148
Edge impulse suggests that the objects should be similar in size and not
overlap for better performance. This is OK in an industrial facility, where
the camera should be fixed, keeping the same distance from the objects to be
detected. Despite that, we will also try using mixed sizes and positions to see
the result.
We will not create separate folders for our images because each
contains multiple labels.
Connect the Nicla Vision to the OpenMV IDE and run the dataset_capture_-
script.py. Clicking on the Capture Image button will start capturing images:
We suggest using around 50 images to mix the objects and vary the number
of each appearing on the scene. Try to capture different angles, backgrounds,
and light conditions.
The stored images use a QVGA frame size 320 × 240 and RGB565
(color pixel format).
Object Detection 1149
After capturing your dataset, close the Dataset Editor Tool on the Tools >
Dataset Editor.
Go to Edge Impulse Studio, enter your credentials at Login (or create an ac-
count), and start a new project.
Here, you can clone the project developed for this hands-on: NICLA_-
Vision_Object_Detection.
On Studio, go to the Data acquisition tab, and on the UPLOAD DATA section,
upload from your computer files captured.
Edge Impulse Studio 1150
You can leave for the Studio to split your data automatically between
Train and Test or do it manually.
All the unlabeled images (51) were uploaded, but they still need to be labeled
appropriately before being used as a dataset in the project. The Studio has a
tool for that purpose, which you can find in the link Labeling queue (51).
There are two ways you can use to perform AI-assisted labeling on the Edge
Impulse Studio (free version):
• Using yolov5
Object Detection 1151
Continue with this process until the queue is empty. At the end, all images
should have the objects labeled as those samples below:
The Impulse Design 1152
Next, review the labeled samples on the Data acquisition tab. If one of the
labels is wrong, it can be edited using the three dots menu after the sample
name:
We will be guided to replace the wrong label and correct the dataset.
The feature explorer shows that all samples evidence a good separation after
the feature generation.
One of the samples (46) is apparently in the wrong space, but click-
ing on it confirms that the labeling is correct.
For training, we should select a pre-trained model. Let’s use the FOMO (Faster
Objects, More Objects) MobileNetV2 0.35. This model uses around 250
KB of RAM and 80 KB of ROM (Flash), which suits well with our board since it
has 1 MB of RAM and ROM.
Once connected, you can use the Nicla to capture actual images to be tested
by the trained model on Edge Impulse Studio.
One thing to note is that the model can produce false positives and negatives.
This can be minimized by defining a proper Confidence Threshold (use the
three dots menu for the setup). Try with 0.8 or more.
When you try to connect the Nicla with the OpenMV IDE again, it will try to
update its FW. Choose the option Load a specific firmware instead. Or go
to ‘Tools > Runs Boatloader (Load Firmware).
You will find a ZIP file on your computer from the Studio. Open it:
Before running the script, let’s change a few lines. Note that you can leave
the window definition as 240 × 240 and the camera capturing images as QV-
GA/RGB. The captured image will be pre-processed by the FW deployed from
Edge Impulse
import sensor
import time
import ml
from ml.utils import NMS
import math
import image
min_confidence = 0.8
Change if necessary, the color of the circles that will be used to display the
detected object’s centroid for a better contrast.
# FOMO outputs an image per class where each pixel in the image is the centroid
# object. So, we will get those output images and then run find_blobs() on them
# centroids. We will also run get_stats() on the detected blobs to determine the
# The Non-Max-Supression (NMS) object then filters out overlapping detections an
# position in the output image back to the original input image. The function th
# list per class which each contain a list of (rect, score) tuples representing
# objects.
clock = time.clock()
while True:
clock.tick()
img = sensor.snapshot()
From the camera’s view, we can see the objects with their centroids marked
with 12 pixel-fixed circles (each circle has a distinct color, depending on its
class). On the Serial Terminal, the model shows the labels detected and their
position on the image window (240 × 240).
Note that the frames per second rate is around 8 fps (similar to what we
got with the Image Classification project). This happens because FOMO is
cleverly built over a CNN model, not with an object detection model like the
SSD MobileNet or YOLO. For example, when running a MobileNetV2 SSD
FPN-Lite 320 × 320 model on a Raspberry Pi 4, the latency is around 5 times
higher (around 1.5 fps)
Here is a short video showing the inference results: https://youtu.be/Jbpoq
Rp3BbM
Conclusion 1162
Conclusion
FOMO is a significant leap in the image processing space, as Louis Moreau and
Mat Kelcey put it during its launch in 2022:
Multiple possibilities exist for exploring object detection (and, more precisely,
counting them) on embedded devices. This can be very useful on projects
counting bees, for example.
Resources
• Edge Impulse Project
Keyword Spotting (KWS)
Overview
Having already explored the Nicla Vision board in the Image Classification and
Object Detection applications, we are now shifting our focus to voice-activated
applications with a project on Keyword Spotting (KWS).
1163
How does a voice assistant work? 1164
Stage 1: A small microprocessor inside the Echo Dot or Google Home con-
tinuously listens, waiting for the keyword to be spotted, using a TinyML model
at the edge (KWS application).
Keyword Spotting (KWS) 1165
Stage 2: Only when triggered by the KWS application on Stage 1 is the data
sent to the cloud and processed on a larger model.
The video below shows an example of a Google Assistant being programmed
on a Raspberry Pi (Stage 2), with an Arduino Nano 33 BLE as the TinyML device
(Stage 1).
https://youtu.be/e_OPgcnsyvM
To explore the above Google Assistant project, please see the tutorial:
Building an Intelligent Voice Assistant From Scratch.
Dataset
The critical component of any Machine Learning Workflow is the dataset. Once
we have decided on specific keywords, in our case (YES and NO), we can
take advantage of the dataset developed by Pete Warden, “Speech Commands:
A Dataset for Limited-Vocabulary Speech Recognition.” This dataset has 35
keywords (with +1,000 samples each), such as yes, no, stop, and go. In words
such as yes and no, we can get 1,500 samples.
You can download a small portion of the dataset from Edge Studio (Keyword
spotting pre-built dataset), which includes samples from the four classes we
will use in this project: yes, no, noise, and background. For this, follow the
steps below:
• Download the keywords dataset.
• Unzip the file to a location of your choice.
Initiate a new project at Edge Impulse Studio (EIS) and select the Upload
Existing Data tool in the Data Acquisition section. Choose the files to be
uploaded:
Define the Label, select Automatically split between train and test,
and Upload data to the EIS. Repeat for all classes.
Keyword Spotting (KWS) 1167
The dataset will now appear in the Data acquisition section. Note that the
approximately 6,000 samples (1,500 for each class) are split into Train (4,800)
and Test (1,200) sets.
The key difference between sound and audio is the type of energy.
Sound is mechanical perturbation (longitudinal sound waves) that
propagate through a medium, causing variations of pressure in it.
Audio is an electrical (analog or digital) signal representing sound.
When we pronounce a keyword, the sound waves should be converted to
audio data. The conversion should be done by sampling the signal generated
by the microphone at a 16 KHz frequency with 16-bit per sample amplitude.
So, any device that can generate audio data with this basic specification (16
KHz/16 bits) will work fine. As a device, we can use the NiclaV, a computer, or
even your mobile phone.
• Put the NiclaV in Boot Mode by pressing the reset button twice.
• Upload the binary arduino-nicla-vision.bin to your board by running the
batch code corresponding to your OS.
Go to your project on EIS, and on the Data Acquisition tab, select WebUSB.
A window will pop up; choose the option that shows that the Nicla is paired
and press [Connect].
You can choose which sensor data to pick in the Collect Data section on
the Data Acquisition tab. Select: Built-in microphone, define your label
(for example, yes), the sampling Frequency[16000Hz], and the Sample length
(in milliseconds), for example [10s]. Start sampling.
Keyword Spotting (KWS) 1169
Data on Pete’s dataset have a length of 1s, but the recorded samples are 10s
long and must be split into 1s samples. Click on three dots after the sample
name and select Split sample.
A window will pop up with the Split tool.
Once inside the tool, split the data into 1-second (1000 ms) records. If neces-
sary, add or remove segments. This procedure should be repeated for all new
samples.
Go to Devices, scan the QR Code using your phone, and click on the link.
A data Collection app will appear in your browser. Select Collecting Audio,
and define your Label, data capture Length, and Category.
Impulse Design
Keyword Spotting (KWS) 1171
First, we will take the data points with a 1-second window, augmenting the
data and sliding that window in 500 ms intervals. Note that the option zero-pad
data is set. It is essential to fill with ‘zeros’ samples smaller than 1 second (in
some cases, some samples can result smaller than the 1000 ms window on the
split tool to avoid noise and spikes).
Each 1-second audio sample should be pre-processed and converted to an
image (for example, 13 × 49 × 1). As discussed in the Feature Engineering for Au-
dio Classification Hands-On tutorial, we will use Audio (MFCC), which extracts
features from audio signals using Mel Frequency Cepstral CoefÏcients, which
are well suited for the human voice, our case here.
Next, we select the Classification block to build our model from scratch
using a Convolution Neural Network (CNN).
Pre-Processing (MFCC)
The following step is to create the features to be trained in the next phase:
We could keep the default parameter values, but we will use the DSP Autotune
parameters option.
We will take the Raw features (our 1-second, 16 KHz sampled audio data)
and use the MFCC processing block to calculate the Processed features. For
every 16,000 raw features (16,000 × 1 second), we will get 637 processed features
(13 × 49).
Creating Impulse (Pre-Process / Model definition) 1172
The result shows that we only used a small amount of memory to pre-process
data (16 KB) and a latency of 34 ms, which is excellent. For example, on an
Arduino Nano (Cortex-M4f @ 64 MHz), the same pre-process will take around
480 ms. The parameters chosen, such as the FFT length [512], will significantly
impact the latency.
Now, let’s Save parameters and move to the Generated features tab, where
the actual features will be generated. Using UMAP, a dimension reduction
technique, the Feature explorer shows how the features are distributed on a
two-dimensional plot.
The result seems OK, with a visually clear separation between yes features
(in red) and no features (in blue). The unknown features seem nearer to the no
space than the yes. This suggests that the keyword no has more propensity to
false positives.
Listen to the samples that went wrong. For example, for yes, most of the
mistakes were related to a yes pronounced as “yeh”. You can acquire additional
samples and then retrain your model.
Testing
Testing the model with the data reserved for training (Test Data), we got an
accuracy of approximately 76%.
Keyword Spotting (KWS) 1175
Inspecting the F1 score, we can see that for YES, we got 0.90, an excellent
result since we expect to use this keyword as the primary “trigger” for our KWS
project. The worst result (0.70) is for UNKNOWN, which is OK.
For NO, we got 0.72, which was expected, but to improve this result, we can
move the samples that were not correctly classified to the training dataset and
then repeat the training process.
Live Classification
We can proceed to the project’s next step but also consider that it is possible to
perform Live Classification using the NiclaV or a smartphone to capture
live samples, testing the trained model before deployment on our device.
When the Build button is selected, a zip file will be created and downloaded
to your computer. On your Arduino IDE, go to the Sketch tab, select the option
Add .ZIP Library, and Choose the .zip file downloaded by EIS:
Now, it is time for a real test. We will make inferences while completely
disconnected from the EIS. Let’s use the NiclaV code example created when
we deployed the Arduino Library.
In your Arduino IDE, go to the File/Examples tab, look for your project, and
select nicla-vision/nicla-vision_microphone (or nicla-vision_microphone_-
continuous)
Press the reset button twice to put the NiclaV in boot mode, upload the sketch
to your board, and test some real inferences:
Post-processing
Now that we know the model is working since it detects our keywords, let’s
modify the code to see the result with the NiclaV completely ofÒine (discon-
Keyword Spotting (KWS) 1177
...
void setup()
{
// Once you finish debugging your code, you can
// comment or delete the Serial part of the code
Serial.begin(115200);
while (!Serial);
Serial.println("Inferencing - Nicla Vision KWS with LEDs");
Create two functions, turn_off_leds() function , to turn off all RGB LEDs
/*
* @brief turn_off_leds function - turn-off all RGB LEDs
*/
void turn_off_leds(){
digitalWrite(LEDR, HIGH);
digitalWrite(LEDG, HIGH);
digitalWrite(LEDB, HIGH);
}
/*
* @brief turn_on_leds function used to turn on the RGB LEDs
* @param[in] pred_index
* no: [0] ==> Red ON
* noise: [1] ==> ALL OFF
* unknown: [2] ==> Blue ON
* Yes: [3] ==> Green ON
*/
void turn_on_leds(int pred_index) {
switch (pred_index)
{
case 0:
turn_off_leds();
digitalWrite(LEDR, LOW);
break;
case 1:
turn_off_leds();
break;
case 2:
turn_off_leds();
digitalWrite(LEDB, LOW);
break;
case 3:
turn_off_leds();
digitalWrite(LEDG, LOW);
break;
}
}
And change the // print the predictions portion of the code on loop():
...
#if EI_CLASSIFIER_HAS_ANOMALY == 1
ei_printf(" anomaly score: ");
ei_printf_float(result.anomaly);
ei_printf("\n");
#endif
print_results = 0;
}
}
...
Conclusion
You will find the notebooks and codeused in this hands-on tutorial
on the GitHub repository.
Before we finish, consider that Sound Classification is more than just voice.
For example, you can develop TinyML projects around sound in several areas,
such as:
• Security (Broken Glass detection, Gunshot)
• Industry (Anomaly Detection)
• Medical (Snore, Cough, Pulmonary diseases)
Resources 1180
Resources
• Subset of Google Speech Commands Dataset
• KWS MFCC Analysis Colab Notebook
• KWS_CNN_training Colab Notebook
• Arduino Post-processing Code
• Edge Impulse Project
Motion Classification and Anomaly De-
tection
Overview
Transportation is the backbone of global commerce. Millions of containers
are transported daily via various means, such as ships, trucks, and trains, to
1181
IMU Installation and testing 1182
L Learning Objectives
By the end of this tutorial, you’ll have a working prototype that can classify
different types of motion and detect anomalies during the transportation of
containers. This knowledge can be a stepping stone to more advanced projects
in the burgeoning field of TinyML involving vibration.
Let’s define a sketch that will allow us to capture our data with a defined
sampling frequency (for example, 50 Hz):
/*
* Based on Edge Impulse Data Forwarder Example (Arduino)
- https://docs.edgeimpulse.com/docs/cli-data-forwarder
* Developed by M.Rovai @11May23
*/
/* Include ------------------------------------------- */
#include <Arduino_LSM6DSOX.h>
void setup() {
Serial.begin(9600);
while (!Serial);
if (!IMU.begin()) {
Serial.println("Failed to initialize IMU!");
while (1);
}
}
void loop() {
if (millis() > last_interval_ms + INTERVAL_MS) {
last_interval_ms = millis();
if (IMU.accelerationAvailable()) {
// Read raw acceleration measurements from the device
IMU.readAcceleration(x, y, z);
// converting to m/s2
float ax_m_s2 = x * CONVERT_G_TO_MS2;
float ay_m_s2 = y * CONVERT_G_TO_MS2;
float az_m_s2 = z * CONVERT_G_TO_MS2;
Serial.print(ax_m_s2);
Serial.print("\t");
Serial.print(ay_m_s2);
Serial.print("\t");
Motion Classification and Anomaly Detection 1185
Serial.println(az_m_s2);
}
}
}
Uploading the sketch and inspecting the Serial Monitor, we can see that we
are capturing 50 samples per second.
Note that with the Nicla board resting on a table (with the camera
facing down), the 𝑧-axis measures around 9.8 m/s2 , the expected
earth acceleration.
From the above images, we can define for our simulation that primarily
horizontal movements (𝑥 or 𝑦 axis) should be associated with the “Terrestrial
class,” Vertical movements (𝑧-axis) with the “Lift Class,” no activity with the
“Idle class,” and movement on all three axes to Maritime class.
Data Collection
For data collection, we can have several options. In a real case, we can have our
device, for example, connected directly to one container, and the data collected
on a file (for example .CSV) and stored on an SD card (Via SPI connection) or
an ofÒine repo in your computer. Data can also be sent remotely to a nearby
repository, such as a mobile phone, using Bluetooth (as done in this project:
Sensor DataLogger). Once your dataset is collected and stored as a .CSV file, it
can be uploaded to the Studio using the CSV Wizard tool.
In this video, you can learn alternative ways to send data to the
Edge Impulse Studio.
Please create a new project on the Edge Impulse Studio (EIS) and connect
the Nicla to it, following these steps:
1. Install the Edge Impulse CLI and the Node.js into your computer.
2. Upload a sketch for data capture (the one discussed previously in this
tutorial).
3. Use the CLI Data Forwarder to capture data from the Nicla’s accelerometer
and send it to the Studio, as shown in this diagram:
Start the CLI Data Forwarder on your terminal, entering (if it is the first time)
the following command:
$ edge-impulse-data-forwarder --clean
Next, enter your EI credentials and choose your project, variables (for exam-
ple, accX, accY, and accZ), and device name (for example, NiclaV:
Data Collection 1188
Go to the Devices section on your EI Project and verify if the device is con-
nected (the dot should be green):
You can clone the project developed for this hands-on: NICLA
Vision Movement Classification.
Data Collection
On the Data Acquisition section, you should see that your board [NiclaV]
is connected. The sensor is available: [sensor with 3 axes (accX, accY,
accZ)] with a sampling frequency of [50 Hz]. The Studio suggests a sample
length of [10000] ms (10 s). The last thing left is defining the sample label.
Let’s start with[terrestrial]:
table. After 10 s, your data will be uploaded to the studio. Here is how the
sample was collected:
As expected, the movement was captured mainly in the 𝑌-axis (green). In the
blue, we see the 𝑍 axis, around -10 m/s2 (the Nicla has the camera facing up).
As discussed before, we should capture data from all four Transportation
Classes. So, imagine that you have a container with a built-in accelerometer
facing the following situations:
Maritime (pallets in boats into an angry ocean). The movement is captured
on all three axes:
You can capture, for example, 2 minutes (twelve samples of 10 seconds) for
each of the four classes (a total of 8 minutes of data). Using the three dots
menu after each one of the samples, select 2 of them, reserving them for the
Test set. Alternatively, you can use the automatic Train/Test Split tool on
the Danger Zone of Dashboard tab. Below, you can see the resulting dataset:
Once you have captured your dataset, you can explore it in more detail using
the Data Explorer, a visual tool to find outliers or mislabeled data (helping to
correct them). The data explorer first tries to extract meaningful features from
your data (by applying signal processing and neural network embeddings)
and then uses a dimensionality reduction algorithm such as PCA or t-SNE to
map these features to a 2D space. This gives you a one-look overview of your
complete dataset.
Motion Classification and Anomaly Detection 1191
In our case, the dataset seems OK (good separation). But the PCA shows we
can have issues between maritime (green) and lift (orange). This is expected,
once on a boat, sometimes the movement can be only “vertical”.
Impulse Design
The next step is the definition of our Impulse, which takes the raw data and
uses signal processing to extract features, passing them as the input tensor of a
learning block to classify new data. Go to Impulse Design and Create Impulse.
The Studio will suggest the basic design. Let’s also add a second Learning Block
for Anomaly Detection.
Impulse Design 1192
This second model uses a K-means model. If we imagine that we could have
our known classes as clusters, any sample that could not fit on that could be
an outlier, an anomaly such as a container rolling out of a ship on the ocean or
falling from a Forklift.
Let’s dig into those steps and parameters to understand better what we are
doing here.
Once the data is preprocessed and segmented, you can extract features that
describe the motion’s characteristics. Some typical features extracted from
accelerometer data include:
• Time-domain features describe the data’s statistical properties within
each segment, such as mean, median, standard deviation, skewness, kur-
tosis, and zero-crossing rate.
• Frequency-domain features are obtained by transforming the data into the
frequency domain using techniques like the Fast Fourier Transform (FFT).
Some typical frequency-domain features include the power spectrum,
spectral energy, dominant frequencies (amplitude and frequency), and
spectral entropy.
• Time-frequency domain features combine the time and frequency do-
main information, such as the Short-Time Fourier Transform (STFT) or
the Discrete Wavelet Transform (DWT). They can provide a more detailed
understanding of how the signal’s frequency content changes over time.
Impulse Design 1194
In many cases, the number of extracted features can be large, which may lead
to overfitting or increased computational complexity. Feature selection tech-
niques, such as mutual information, correlation-based methods, or principal
component analysis (PCA), can help identify the most relevant features for a
given application and reduce the dimensionality of the dataset. The Studio can
help with such feature importance calculations.
So, for an FFT length of 32 points, the resulting output of the Spectral Analysis
Block will be 21 features per axis (a total of 63 features).
You can learn more about how each feature is calculated by down-
loading the notebook Edge Impulse - Spectral Features Block Analy-
sis TinyML under the hood: Spectral Analysis or opening it directly
on Google CoLab.
Generating features
Once we understand what the pre-processing does, it is time to finish the job.
So, let’s take the raw data (time-series type) and convert it to tabular data. For
that, go to the Spectral Features section on the Parameters tab, define the
main parameters as discussed in the previous section ([FFT] with [32] points),
and select[Save Parameters]:
Motion Classification and Anomaly Detection 1195
At the top menu, select the Generate Features option and the Generate
Features button. Each 2-second window data will be converted into one data
point of 63 features.
The Feature Explorer will show those data in 2D using UMAP. Uni-
form Manifold Approximation and Projection (UMAP) is a dimen-
sion reduction technique that can be used for visualization similarly
to t-SNE but is also applicable for general non-linear dimension
reduction.
The visualization makes it possible to verify that after the feature generation,
the classes present keep their excellent separation, which indicates that the
classifier should work well. Optionally, you can analyze how important each
one of the features is for one class compared with others.
Models Training 1196
Models Training
Our classifier will be a Dense Neural Network (DNN) that will have 63 neurons
on its input layer, two hidden layers with 20 and 10 neurons, and an output
layer with four neurons (one per each class), as shown here:
For Anomaly Detection, we will choose the suggested features that are pre-
cisely the most important ones in the Feature Extraction, plus the accZ RMS.
The number of clusters will be [32], as suggested by the Studio:
Motion Classification and Anomaly Detection 1197
Testing
We can verify how our model will behave with unknown data using 20% of
the data left behind during the data capture phase. The result was almost 95%,
which is good. You can always work to improve the results, for example, to
understand what went wrong with one of the wrong results. If it is a unique
situation, you can add it to the training dataset and then repeat it.
The default minimum threshold for a considered uncertain result is [0.6] for
classification and [0.3] for anomaly. Once we have four classes (their output
sum should be 1.0), you can also set up a lower threshold for a class to be
considered valid (for example, 0.4). You can Set confidence thresholds on
the three dots menu, besides the Classify all button.
You can also perform Live Classification with your device (which should still
be connected to the Studio).
Be aware that here, you will capture real data with your device and
upload it to the Studio, where an inference will be taken using the
trained model (But the model is NOT in your device).
Deploy
It is time to deploy the preprocessing block and the trained model to the Nicla.
The Studio will package all the needed libraries, preprocessing functions, and
trained models, downloading them to your computer. You should select the op-
tion Arduino Library, and at the bottom, you can choose Quantized (Int8)
or Unoptimized (float32) and [Build]. A Zip file will be created and down-
loaded to your computer.
Deploy 1198
On your Arduino IDE, go to the Sketch tab, select Add.ZIP Library, and
Choose the.zip file downloaded by the Studio. A message will appear in the
IDE Terminal: Library installed.
Inference
Now, it is time for a real test. We will make inferences wholly disconnected
from the Studio. Let’s change one of the code examples created when you
deploy the Arduino Library.
In your Arduino IDE, go to the File/Examples tab and look for your project,
and on examples, select Nicla_vision_fusion:
Note that the code created by Edge Impulse considers a sensor fusion approach
where the IMU (Accelerometer and Gyroscope) and the ToF are used. At the
beginning of the code, you have the libraries related to our project, IMU and
ToF:
/* Includes ---------------------------------------------- */
#include <NICLA_Vision_Movement_Classification_inferencing.h>
#include <Arduino_LSM6DSOX.h> //IMU
#include "VL53L1X.h" // ToF
You can keep the code this way for testing because the trained model
will use only features pre-processed from the accelerometer. But
Motion Classification and Anomaly Detection 1199
consider that you will write your code only with the needed libraries
for a real project.
Note that in all situations above, the value of the anomaly score was smaller
than 0.0. Try a new movement that was not part of the original dataset, for
example, “rolling” the Nicla, facing the camera upside-down, as a container
falling from a boat or even a boat accident:
Conclusion 1200
• Anomaly detection:
Post-processing
Now that we know the model is working since it detects the movements, we
suggest that you modify the code to see the result with the NiclaV completely
ofÒine (disconnected from the PC and powered by a battery, a power bank, or
an independent 5 V power supply).
The idea is to do the same as with the KWS project: if one specific movement
is detected, a specific LED could be lit. For example, if terrestrial is detected, the
Green LED will light; if maritime, the Red LED will light, if it is a lift, the Blue
LED will light; and if no movement is detected (idle), the LEDs will be OFF. You
can also add a condition when an anomaly is detected, in this case, for example,
a white color can be used (all e LEDs light simultaneously).
Conclusion
The notebooks and codeused in this hands-on tutorial will be found
on the GitHub repository.
Case Applications
Industrial and Manufacturing
Healthcare
• Patient Monitoring: Detecting falls or abnormal movements in the elderly
or those with mobility issues.
• Rehabilitation: Monitoring the progress of patients recovering from
injuries by classifying motion patterns during physical therapy sessions.
• Activity Recognition: Classifying types of physical activity for fitness
applications or patient monitoring.
Consumer Electronics
• Gesture Control: Interpreting specific motions to control devices, such
as turning on lights with a hand wave.
• Gaming: Enhancing gaming experiences with motion-controlled inputs.
Agriculture
• Equipment Monitoring: Tracking the performance and usage of agricul-
tural machinery.
• Animal Behavior Analysis: Monitoring livestock movements to detect
behaviors indicating health issues or stress.
Environmental Monitoring
• Seismic Activity: Detecting irregular motion patterns that precede earth-
quakes or other geologically relevant events.
• Oceanography: Studying wave patterns or marine movements for re-
search and safety purposes.
Resources 1202
Nicla 3D case
For real applications, as some described before, we can add a case to our device,
and Eoin Jordan, from Edge Impulse, developed a great wearable and machine
health case for the Nicla range of boards. It works with a 10mm magnet, 2M
screws, and a 16mm strap for human and machine health use case scenarios.
Here is the link: Arduino Nicla Voice and Vision Wearable Case.
The applications for motion classification and anomaly detection are exten-
sive, and the Arduino Nicla Vision is well-suited for scenarios where low power
consumption and edge processing are advantageous. Its small form factor and
efÏciency in processing make it an ideal choice for deploying portable and
remote applications where real-time processing is crucial and connectivity may
be limited.
Resources
• Arduino Code
• Edge Impulse Spectral Features Block Colab Notebook
• Edge Impulse Project
XIAO ESP32S3
Pre-requisites
• XIAO ESP32S3 Sense Board: Ensure you have the XIAO ESP32S3 Sense
Board.
• USB-C Cable: This is for connecting the board to your computer.
• Network: With internet access for downloading necessary software.
• SD Card and an SD card Adapter: This saves audio and images (optional).
1203
Setup 1204
Setup
• Setup XIAO ESP32S3
Exercises
Overview
The XIAO ESP32S3 Sense is Seeed Studio’s affordable development board,
which integrates a camera sensor, digital microphone, and SD card support.
Combining embedded ML computing power and photography capability, this
1205
Overview 1206
development board is a great tool to start with TinyML (intelligent voice and
vision AI).
For more details, please refer to the Seeed Studio WiKi page: https:
//wiki.seeedstudio.com/xiao_esp32s3_getting_started/
Next, open boards manager. Go to Tools > Board > Boards Manager… and
enter with esp32. Select and install the most updated and stable package (avoid
alpha versions):
Testing the board with BLINK 1208
� Attention
Alpha versions (for example, 3.x-alpha) do not work correctly with
the XIAO and Edge Impulse. Use the last stable version (for example,
2.0.11) instead.
Last but not least, choose the Port where the ESP32S3 is connected.
That is it! The device should be OK. Let’s do some tests.
#define LED_BUILT_IN 21
void setup() {
pinMode(LED_BUILT_IN, OUTPUT); // Set the pin as output
}
Note that the pins work with inverted logic: LOW to Turn on and
HIGH to turn off.
Setup 1209
Microphone Test
Let’s start with sound detection. Go to the GitHub project and download the
sketch: XIAOEsp2s3_Mic_Test and run it on the Arduino IDE:
Microphone Test 1210
For a start, Insert the SD Card on the XIAO as shown in the photo below (the
SD Card should be formatted to FAT32).
Testing WiFi
One of the XIAO ESP32S3’s differentiators is its WiFi capability. So, let’s test its
radio by scanning the Wi-Fi networks around it. You can do this by running
one of the code examples on the board.
Setup 1213
You can monitor how your server is working with the Serial Monitor.
Testing WiFi 1214
You will see a page with links that can turn the built-in LED of your XIAO
ON and OFF.
Streaming video to Web
Now that you know that you can send commands from the webpage to your
device, let’s do the reverse. Let’s take the image captured by the camera and
stream it to a webpage:
Download from GitHub the folder that contains the code: XIAO-ESP32S3-
Streeming_Video.ino.
Enter your credentials and run the sketch. On the Serial monitor, you can
find the page address to enter in your browser:
Setup 1215
Open the page on your browser (wait a few seconds to start the streaming).
That’s it.
Streamlining what your camera is “seen” can be important when you position
it to capture a dataset for an ML project (for example, using the code “take_-
phots_commands.ino”.
Of course, we can do both things simultaneously: show what the camera
sees on the page and send a command to capture and save the image on the SD
card. For that, you can use the code Camera_HTTP_Server_STA, which can be
downloaded from GitHub.
Testing WiFi 1216
Inspect the code; it will be easier to understand how the camera works. This
code was developed based on the great Rui Santos Tutorial ESP32-CAM Take
Photo and Display in Web Server, which I invite all of you to visit.
Using the CameraWebServer
In the Arduino IDE, go to File > Examples > ESP32 > Camera, and select
CameraWebServer
You also should comment on all cameras’ models, except the XIAO model
pins:
#define CAMERA_MODEL_XIAO_ESP32S3 // Has PSRAM
Do not forget the Tools to enable the PSRAM.
Enter your wifi credentials and upload the code to the device:
Setup 1217
If the code is executed correctly, you should see the address on the Serial
Monitor:
Copy the address on your browser and wait for the page to be uploaded.
Select the camera resolution (for example, QVGA) and select [START STREAM].
Wait for a few seconds/minutes, depending on your connection. Using the
[Save] button, you can save an image to your computer download area.
Conclusion 1218
That’s it! You can save the images directly on your computer for use on
projects.
Conclusion
The XIAO ESP32S3 Sense is flexible, inexpensive, and easy to program. With
8 MB of RAM, memory is not an issue, and the device can handle many post-
processing tasks, including communication.
You will find the last version of the codeon the GitHub repository: XIAO-
ESP32S3-Sense.
Resources
• XIAO ESP32S3 Code
Image Classification
Overview
More and more, we are facing an artificial intelligence (AI) revolution where, as
stated by Gartner, Edge AI has a very high impact potential, and it is for now!
1219
A TinyML Image Classification Project – Fruits versus Veggies 1220
The whole idea of our project will be to train a model and proceed with
inference on the XIAO ESP32S3 Sense. For training, we should find some data
(in fact, tons of data!).
But first of all, we need a goal! What do we want to classify?
Image Classification 1221
So, let’s find a specific dataset that includes images from those categories.
Kaggle is a good start:
https://www.kaggle.com/kritikseth/fruit-and-vegetable-image-recognitio
n
Each category is split into the train (100 images), test (10 images), and vali-
dation (10 images).
• Download the dataset from the Kaggle website and put it on your com-
puter.
Optionally, you can add some fresh photos of bananas, apples, and
potatoes from your home kitchen, using, for example, the code
discussed in the next setup lab.
We will use the Edge Impulse Studio to train our model. As you may know,
Edge Impulse is a leading development platform for machine learning on edge
devices.
Enter your account credentials (or create a free account) at Edge Impulse.
Next, create a new project:
Training the model with Edge Impulse Studio 1222
Data Acquisition
Next, on the UPLOAD DATA section, upload from your computer the files from
chosen categories:
It would be best if you now had your training dataset split into three classes
of data:
Image Classification 1223
You can upload extra data for further model testing or split the
training data. I will leave it as it is to use the most data possible.
Impulse Design
An impulse takes raw data (in this case, images), extracts features
(resize pictures), and then uses a learning block to classify new data.
Classifying images is the most common use of deep learning, but a lot of
data should be used to accomplish this task. We have around 90 images for
each category. Is this number enough? Not at all! We will need thousands of
images to “teach or model” to differentiate an apple from a banana. But, we
can solve this issue by re-training a previously trained model with thousands
of images. We call this technique “Transfer Learning” (TL).
Training the model with Edge Impulse Studio 1224
Besides resizing the images, we can change them to Grayscale or keep the actual
RGB color depth. Let’s start selecting Grayscale. Doing that, each one of our
data samples will have dimension 9,216 features (96 × 96 × 1). Keeping RGB,
this dimension would be three times bigger. Working with Grayscale helps to
reduce the amount of final memory needed for inference.
Image Classification 1225
Model Design
Transfer Learning
In 2007, Google introduced MobileNetV1, a family of general-purpose com-
puter vision neural networks designed with mobile devices in mind to support
classification, detection, and more. MobileNets are small, low-latency, low-
power models parameterized to meet the resource constraints of various use
cases.
Although the base MobileNet architecture is already tiny and has low latency,
many times, a specific use case or application may require the model to be
smaller and faster. MobileNet introduces a straightforward parameter 𝛼 (alpha)
called width multiplier to construct these smaller, less computationally expen-
sive models. The role of the width multiplier 𝛼 is to thin a network uniformly
at each layer.
Edge Impulse Studio has MobileNet V1 (96x96 images) and V2 (96x96 and
16x160 images) available, with several different 𝛼 values (from 0.05 to 1.0). For
example, you will get the highest accuracy with V2, 160 × 160 images, and
𝛼 = 1.0. Of course, there is a trade-off. The higher the accuracy, the more
memory (around 1.3 M RAM and 2.6 M ROM) will be needed to run the model,
implying more latency.
The smaller footprint will be obtained at another extreme with MobileNet
V1 and 𝛼 = 0.10 (around 53.2 K RAM and 101 K ROM).
For this first pass, we will use MobileNet V1 and 𝛼 = 0.10.
Training
Data Augmentation
Another necessary technique to use with deep learning is data augmentation.
Data augmentation is a method that can help improve the accuracy of machine
learning models, creating additional artificial data. A data augmentation system
Training the model with Edge Impulse Studio 1226
makes small, random changes to your training data during the training process
(such as flipping, cropping, or rotating the images).
Under the rood, here you can see how Edge Impulse implements a data
Augmentation policy on your data:
Exposure to these variations during training can help prevent your model
from taking shortcuts by “memorizing” superficial clues in your training data,
meaning it may better reflect the deep underlying patterns in your dataset.
The final layer of our model will have 16 neurons with a 10% dropout for
overfitting prevention. Here is the Training output:
The result could be better. The model reached around 77% accuracy, but
the amount of RAM expected to be used during the inference is relatively tiny
(about 60 KBytes), which is very good.
Image Classification 1227
Deployment
The trained model will be deployed as a .zip Arduino library:
Open your Arduino IDE, and under Sketch, go to Include Library and
add.ZIP Library. Please select the file you download from Edge Impulse Studio,
and that’s it!
Under the Examples tab on Arduino IDE, you should find a sketch code
under your project name.
Training the model with Edge Impulse Studio 1228
You can see that the first line of code is exactly the calling of a library with
all the necessary stuff for running inference on your device.
Image Classification 1229
#include <XIAO-ESP32S3-CAM-Fruits-vs-Veggies_inferencing.h>
Of course, this is a generic code (a “template”) that only gets one sample of
raw data (stored on the variable: features = {} and runs the classifier, doing the
inference. The result is shown on the Serial Monitor.
We should get the sample (image) from the camera and pre-process it (resizing
to 96 × 96, converting to grayscale, and flatting it). This will be the input tensor
of our model. The output tensor will be a vector with three values (labels),
showing the probabilities of each one of the classes.
Returning to your project (Tab Image), copy one of the Raw Data Sample:
9,216 features will be copied to the clipboard. This is the input tensor (a
flattened image of 96 × 96 × 1), in this case, bananas. Past this Input tensor
onfeatures[] = {0xb2d77b, 0xb5d687, 0xd8e8c0, 0xeaecba, 0xc2cf67,
...}
Training the model with Edge Impulse Studio 1230
Edge Impulse included the library ESP NN in its SDK, which contains opti-
mized NN (Neural Network) functions for various Espressif chips, including
the ESP32S3 (running at Arduino IDE).
When running the inference, you should get the highest score for “banana.”
Great news! Our device handles an inference, discovering that the input
image is a banana. Also, note that the inference time was around 317 ms,
resulting in a maximum of 3 fps if you tried to classify images from a video.
Now, we should incorporate the camera and classify images in real time.
Go to the Arduino IDE Examples and download from your project the sketch
esp32_camera:
Image Classification 1231
You should change lines 32 to 75, which define the camera model and pins,
using the data related to our model. Copy and paste the below lines, replacing
the lines 32-75:
#define PWDN_GPIO_NUM -1
#define RESET_GPIO_NUM -1
#define XCLK_GPIO_NUM 10
#define SIOD_GPIO_NUM 40
#define SIOC_GPIO_NUM 39
#define Y9_GPIO_NUM 48
#define Y8_GPIO_NUM 11
#define Y7_GPIO_NUM 12
#define Y6_GPIO_NUM 14
#define Y5_GPIO_NUM 16
#define Y4_GPIO_NUM 18
#define Y3_GPIO_NUM 17
#define Y2_GPIO_NUM 15
#define VSYNC_GPIO_NUM 38
#define HREF_GPIO_NUM 47
#define PCLK_GPIO_NUM 13
Getting a photo with the camera, the classification result will appear on the
Serial Monitor:
Image Classification 1233
Other tests:
Even with a bigger model, the accuracy could be better, and the amount of
memory necessary to run the model increases five times, with latency increasing
seven times.
Note that the performance here is estimated with a smaller device,
the ESP-EYE. The actual inference with the ESP32S3 should be better.
Testing with a Bigger Model 1234
For the test, we can train the model again, using the smallest version of
MobileNet V2, with an alpha of 0.05. Interesting that the result in accuracy was
higher.
Image Classification 1235
Note that the estimated latency for an Arduino Portenta (or Nicla),
running with a clock of 480 MHz is 45 ms.
Deploying the model, we got an inference of only 135 ms, remembering that
the XIAO runs with half of the clock used by the Portenta/Nicla (240 MHz):
In our case, we will use the blue button at the bottom of the page: [Upload
Custom AI Model].
Image Classification 1237
But first, we must download from Edge Impulse Studio our quantized.tflite
model.
3. Go to your project at Edge Impulse Studio, or clone this one:
• XIAO-ESP32S3-CAM-Fruits-vs-Veggies-v1-ESP-NN
4. On the Dashboard, download the model (“block output”): Transfer
learning model - TensorFlow Lite (int8 quantized).
Note that you should use the labels trained on EI Studio, entering
them in alphabetic order (in our case: apple, banana, potato).
Running inference on the SenseCraft-Web-Toolkit 1238
After a few seconds (or minutes), the model will be uploaded to your device,
and the camera image will appear in real-time on the Preview Sector:
The Classification result will be at the top of the image. You can also select
the Confidence of your inference cursor Confidence.
Clicking on the top button (Device Log), you can open a Serial Monitor to
follow the inference, the same that we have done with the Arduino IDE:
Image Classification 1239
Conclusion
The XIAO ESP32S3 Sense is very flexible, inexpensive, and easy to program.
The project proves the potential of TinyML. Memory is not an issue; the device
can handle many post-processing tasks, including communication.
You will find the last version of the codeon the GitHub repository: XIAO-
ESP32S3-Sense.
Resources
• XIAO ESP32S3 Codes
• Dataset
• Edge Impulse Project
Object Detection
Overview
In the last section regarding Computer Vision (CV) and the XIAO ESP32S3,
Image Classification, we learned how to set up and classify images with this
remarkable development board. Continuing our CV journey, we will explore
Object Detection on microcontrollers.
1241
Overview 1242
To solve this issue, we need another type of model, where not only multiple
categories (or labels) can be found but also where the objects are located on a
given image.
Object Detection 1243
As we can imagine, such models are much more complicated and bigger, for
example, the MobileNetV2 SSD FPN-Lite 320x320, trained with the COCO
dataset. This pre-trained object detection model is designed to locate up to 10
objects within an image, outputting a bounding box for each object detected.
The below image is the result of such a model running on a Raspberry Pi:
Those models used for object detection (such as the MobileNet SSD or YOLO)
usually have several MB in size, which is OK for use with Raspberry Pi but
unsuitable for use with embedded devices, where the RAM usually has, at
most, a few MB as in the case of the XIAO ESP32S3.
To understand more about FOMO, you can go into the ofÏcial FOMO
announcement by Edge Impulse, where Louis Moreau and Mat
Kelcey explain in detail how it works.
We are interested in which object is in the image, its location (centroid), and
how many we can find on it. The object’s size is not detected with FOMO, as
with MobileNet SSD or YOLO, where the Bounding Box is one of the model
outputs.
Object Detection 1245
We will develop the project using the XIAO ESP32S3 for image capture and
model inference. The ML project will be developed using the Edge Impulse
Studio. But before starting the object detection project in the Studio, let’s create
a raw dataset (not labeled) with images that contain the objects to be detected.
Data Collection
You can capture images using the XIAO, your phone, or other devices. Here,
we will use the XIAO with code from the Arduino IDE ESP32 library.
� Attention
Alpha versions (for example, 3.x-alpha) do not work correctly with
the XIAO and Edge Impulse. Use the last stable version (for example,
2.0.11) instead.
You also should comment on all cameras’ models, except the XIAO model
pins:
#define CAMERA_MODEL_XIAO_ESP32S3 // Has PSRAM
And on Tools, enable the PSRAM. Enter your wifi credentials and upload
the code to the device:
Data Collection 1246
If the code is executed correctly, you should see the address on the Serial
Monitor:
Copy the address on your browser and wait for the page to be uploaded.
Select the camera resolution (for example, QVGA) and select [START STREAM].
Wait for a few seconds/minutes, depending on your connection. You can save
an image on your computer download area using the [Save] button.
Edge impulse suggests that the objects should be similar in size and not
overlapping for better performance. This is OK in an industrial facility, where
the camera should be fixed, keeping the same distance from the objects to be
detected. Despite that, we will also try using mixed sizes and positions to see
the result.
We suggest using around 50 images to mix the objects and vary the number
of each appearing on the scene. Try to capture different angles, backgrounds,
and light conditions.
The stored images use a QVGA frame size of 320 × 240 and RGB565
(color pixel format).
After capturing your dataset, [Stop Stream] and move your images to a
folder.
Object Detection 1247
Here, you can clone the project developed for this hands-on: XIAO-
ESP32S3-Sense-Object_Detection
On your Project Dashboard, go down and on Project info and select Bound-
ing boxes (object detection) and Espressif ESP-EYE (most similar to our board)
as your Target Device:
Edge Impulse Studio 1248
You can leave for the Studio to split your data automatically between
Train and Test or do it manually. We will upload all of them as
training.
All the not-labeled images (47) were uploaded but must be labeled appro-
priately before being used as a project dataset. The Studio has a tool for that
purpose, which you can find in the link Labeling queue (47).
There are two ways you can use to perform AI-assisted labeling on the Edge
Impulse Studio (free version):
• Using yolov5
• Tracking objects between frames
Object Detection 1249
You can use the EI uploader to import your data if you already have
a labeled dataset containing bounding boxes.
Continue with this process until the queue is empty. At the end, all images
should have the objects labeled as those samples below:
Edge Impulse Studio 1250
Next, review the labeled samples on the Data acquisition tab. If one of the
labels is wrong, you can edit it using the three dots menu after the sample name:
You will be guided to replace the wrong label and correct the dataset.
with 58 images). After labeling them, it is time to select some images and move
them to the test dataset. You can do it using the three-dot menu after the image
name. I selected six images, representing 13% of the total dataset.
The Studio moves automatically to the next section, Generate features, where
all samples will be pre-processed, resulting in a dataset with individual 96 ×
96 × 1 images or 9,216 features.
The feature explorer shows that all samples evidence a good separation after
the feature generation.
Some samples seem to be in the wrong space, but clicking on them
confirms the correct labeling.
Object Detection 1253
For training, we should select a pre-trained model. Let’s use the FOMO
(Faster Objects, More Objects) MobileNetV2 0.35. This model uses around
250 KB of RAM and 80 KB of ROM (Flash), which suits well with our board.
Model Design, Training, and Test 1254
Once connected, you can use the smartphone to capture actual images to be
tested by the trained model on Edge Impulse Studio.
Deploying the Model (Arduino IDE) 1256
One thing to be noted is that the model can produce false positives and
negatives. This can be minimized by defining a proper Confidence Threshold
(use the Three dots menu for the setup). Try with 0.8 or more.
Open your Arduino IDE, and under Sketch, go to Include Library and add.ZIP
Library. Select the file you download from Edge Impulse Studio, and that’s it!
Under the Examples tab on Arduino IDE, you should find a sketch code
(esp32 > esp32_camera) under your project name.
Object Detection 1257
You should change lines 32 to 75, which define the camera model and pins,
using the data related to our model. Copy and paste the below lines, replacing
the lines 32-75:
#define PWDN_GPIO_NUM -1
#define RESET_GPIO_NUM -1
#define XCLK_GPIO_NUM 10
#define SIOD_GPIO_NUM 40
#define SIOC_GPIO_NUM 39
#define Y9_GPIO_NUM 48
#define Y8_GPIO_NUM 11
#define Y7_GPIO_NUM 12
#define Y6_GPIO_NUM 14
#define Y5_GPIO_NUM 16
#define Y4_GPIO_NUM 18
#define Y3_GPIO_NUM 17
#define Y2_GPIO_NUM 15
#define VSYNC_GPIO_NUM 38
#define HREF_GPIO_NUM 47
#define PCLK_GPIO_NUM 13
Upload the code to your XIAO ESP32S3 Sense, and you should be OK to start
detecting fruits and bugs. You can check the result on Serial Monitor.
Deploying the Model (Arduino IDE) 1258
Background
Fruits
Bugs
Object Detection 1259
Note that the model latency is 143 ms, and the frame rate per second is
around 7 fps (similar to what we got with the Image Classification project).
This happens because FOMO is cleverly built over a CNN model, not with an
object detection model like the SSD MobileNet. For example, when running a
MobileNetV2 SSD FPN-Lite 320 × 320 model on a Raspberry Pi 4, the latency
is around five times higher (around 1.5 fps).
In our case, we will use the blue button at the bottom of the page: [Upload
Custom AI Model].
But first, we must download from Edge Impulse Studio our quantized .tflite
model.
3. Go to your project at Edge Impulse Studio, or clone this one:
• XIAO-ESP32S3-CAM-Fruits-vs-Veggies-v1-ESP-NN
4. On Dashboard, download the model (“block output”): Object Detection
model - TensorFlow Lite (int8 quantized)
Note that you should use the labels trained on EI Studio and enter
them in alphabetic order (in our case, background, bug, fruit).
Object Detection 1261
After a few seconds (or minutes), the model will be uploaded to your device,
and the camera image will appear in real-time on the Preview Sector:
The detected objects will be marked (the centroid). You can select the Con-
fidence of your inference cursor Confidence and IoU, which is used to assess
the accuracy of predicted bounding boxes compared to truth bounding boxes.
Clicking on the top button (Device Log), you can open a Serial Monitor to
follow the inference, as we did with the Arduino IDE.
Note that in the above example, we got 5 boxes because none of the
fruits got 3 centroids. One solution will be post-processing, where
we can aggregate close centroids in one.
Here are other screenshots:
Conclusion 1262
Conclusion
FOMO is a significant leap in the image processing space, as Louis Moreau and
Mat Kelcey put it during its launch in 2022:
Multiple possibilities exist for exploring object detection (and, more precisely,
counting them) on embedded devices.
Resources
• Edge Impulse Project
Keyword Spotting (KWS)
Overview
Keyword Spotting (KWS) is integral to many voice recognition systems, en-
abling devices to respond to specific words or phrases. While this technology
underpins popular devices like Google Assistant or Amazon Alexa, it’s equally
applicable and achievable on smaller, low-power devices. This lab will guide
you through implementing a KWS system using TinyML on the XIAO ESP32S3
microcontroller board.
The XIAO ESP32S3, equipped with Espressif’s ESP32-S3 chip, is a compact
and potent microcontroller offering a dual-core Xtensa LX7 processor, inte-
grated Wi-Fi, and Bluetooth. Its balance of computational power, energy ef-
ficiency, and versatile connectivity make it a fantastic platform for TinyML
applications. Also, with its expansion board, we will have access to the “sense”
part of the device, which has a 1600 × 1200 OV2640 camera, an SD card slot,
and a digital microphone. The integrated microphone and the SD card will be
essential in this project.
We will use the Edge Impulse Studio, a powerful, user-friendly platform that
simplifies creating and deploying machine learning models onto edge devices.
1263
Overview 1264
We’ll train a KWS model step-by-step, optimizing and deploying it onto the
XIAO ESP32S3 Sense.
Our model will be designed to recognize keywords that can trigger device
wake-up or specific actions (in the case of “YES”), bringing your projects to life
with voice-activated commands.
Leveraging our experience with TensorFlow Lite for Microcontrollers (the
engine “under the hood” on the EI Studio), we’ll create a KWS system capable
of real-time machine learning on the device.
As we progress through the lab, we’ll break down each process stage –
from data collection and preparation to model training and deployment – to
provide a comprehensive understanding of implementing a KWS system on a
microcontroller.
• NO (Keyword 2)
• NOISE (no keywords spoken, only background noise is present)
• UNKNOWN (a mix of different words than YES and NO)
Dataset
The critical component of Machine Learning Workflow is the dataset. Once
we have decided on specific keywords (YES and NO), we can take advantage
of the dataset developed by Pete Warden, “Speech Commands: A Dataset for
Limited-Vocabulary Speech Recognition.” This dataset has 35 keywords (with
+1,000 samples each), such as yes, no, stop, and go. In other words, we can get
1,500 samples of yes and no.
You can download a small portion of the dataset from Edge Studio (Keyword
spotting pre-built dataset), which includes samples from the four classes we
will use in this project: yes, no, noise, and background. For this, follow the
steps below:
• Download the keywords dataset.
• Unzip the file in a location of your choice.
Although we have a lot of data from Pete’s dataset, collecting some words
spoken by us is advised. When working with accelerometers, creating a dataset
with data captured by the same type of sensor was essential. In the case of
sound, it is different because what we will classify is, in reality, audio data.
The key difference between sound and audio is their form of energy.
Sound is mechanical wave energy (longitudinal sound waves) that
propagate through a medium causing variations in pressure within
the medium. Audio is made of electrical energy (analog or digital
signals) that represent sound electrically.
Keyword Spotting (KWS) 1267
The sound waves should be converted to audio data when we speak a key-
word. The conversion should be done by sampling the signal generated by the
microphone in 16 kHz with a 16-bit depth.
So, any device that can generate audio data with this basic specification (16
kHz/16 bits) will work fine. As a device, we can use the proper XIAO ESP32S3
Sense, a computer, or even your mobile phone.
What is I2S?
I2S, or Inter-IC Sound, is a standard protocol for transmitting digital audio
from one device to another. It was initially developed by Philips Semicon-
ductor (now NXP Semiconductors). It is commonly used in audio devices
such as digital signal processors, digital audio processors, and, more recently,
microcontrollers with digital audio capabilities (our case here).
Dataset 1268
1. Bit (or Serial) clock line (BCLK or CLK): This line toggles to indicate the
start of a new bit of data (pin IO42).
2. Word select line (WS): This line toggles to indicate the start of a new word
(left channel or right channel). The Word select clock (WS) frequency defines
the sample rate. In our case, L/R on the microphone is set to ground, meaning
that we will use only the left channel (mono).
3. Data line (SD): This line carries the audio data (pin IO41)
In an I2S data stream, the data is sent as a sequence of frames, each containing
a left-channel word and a right-channel word. This makes I2S particularly suited
for transmitting stereo audio data. However, it can also be used for mono or
multichannel audio with additional data lines.
Let’s start understanding how to capture raw data using the microphone. Go
to the GitHub projectand download the sketch: XIAOEsp2s3_Mic_Test:
/*
XIAO ESP32S3 Simple Mic Test
*/
#include <I2S.h>
void setup() {
Serial.begin(115200);
while (!Serial) {
}
void loop() {
// read a sample
int sample = I2S.read();
This code is a simple microphone test for the XIAO ESP32S3 using the I2S
(Inter-IC Sound) interface. It sets up the I2S interface to capture audio data at
a sample rate of 16 kHz with 16 bits per sample and then continuously reads
samples from the microphone and prints them to the serial monitor.
Let’s dig into the code’s main parts:
• Include the I2S library: This library provides functions to configure and
use the I2S interface, which is a standard for connecting digital audio
devices.
• I2S.setAllPins(–1, 42, 41, –1, –1): This sets up the I2S pins. The parameters
are (–1, 42, 41, –1, –1), where the second parameter (42) is the PIN for the
I2S clock (CLK), and the third parameter (41) is the PIN for the I2S data
(DATA) line. The other parameters are set to –1, meaning those pins are
not used.
• I2S.begin(PDM_MONO_MODE, 16000, 16): This initializes the I2S in-
terface in Pulse Density Modulation (PDM) mono mode, with a sample
rate of 16 kHz and 16 bits per sample. If the initialization fails, an error
message is printed, and the program halts.
• int sample = I2S.read(): This reads an audio sample from the I2S interface.
For a start, Insert the SD Card on the XIAO as shown in the photo below (the
SD Card should be formatted to FAT32).
Turn the PSRAM function of the ESP-32 chip on (Arduino IDE): Tools>PSRAM:
“OPI PSRAM”>OPI PSRAM
Keyword Spotting (KWS) 1271
#include <I2S.h>
#include "FS.h"
#include "SD.h"
#include "SPI.h"
Those are the necessary libraries for the program. I2S.h allows for audio
input, FS.h provides file system handling capabilities, SD.h enables the program
to interact with an SD card, and SPI.h handles the SPI communication with the
SD card.
#define RECORD_TIME 10
#define SAMPLE_RATE 16000U
#define SAMPLE_BITS 16
#define WAV_HEADER_SIZE 44
#define VOLUME_GAIN 2
int fileNumber = 1;
String baseFileName;
bool isRecording = false;
These variables keep track of the current file number (to create unique file
names), the base file name, and whether the system is currently recording.
void setup() {
Serial.begin(115200);
while (!Serial);
while (1);
}
if(!SD.begin(21)){
Serial.println("Failed to mount SD Card!");
while (1);
}
Serial.printf("Enter with the label name\n");
}
The setup function initializes the serial communication, I2S interface for
audio input, and SD card interface. If the I2S did not initialize or the SD card
fails to mount, it will print an error message and halt execution.
void loop() {
if (Serial.available() > 0) {
String command = Serial.readStringUntil('\n');
command.trim();
if (command == "rec") {
isRecording = true;
} else {
baseFileName = command;
fileNumber = 1; //reset file number each time a new
basefile name is set
Serial.printf("Send rec for starting recording label \n");
}
}
if (isRecording && baseFileName != "") {
String fileName = "/" + baseFileName + "."
+ String(fileNumber) + ".wav";
fileNumber++;
record_wav(fileName);
delay(1000); // delay to avoid recording multiple files
at once
isRecording = false;
}
}
In the main loop, the program waits for a command from the serial monitor.
If the command is rec, the program starts recording. Otherwise, the command
is assumed to be the base name for the .wav files. If it’s currently recording and
a base file name is set, it records the audio and saves it as a.wav file. The file
names are generated by appending the file number to the base file name.
...
rec_buffer = (uint8_t *)ps_malloc(record_size);
...
esp_i2s::i2s_read(esp_i2s::I2S_NUM_0,
rec_buffer,
record_size,
&sample_size,
portMAX_DELAY);
...
}
This function records audio and saves it as a.wav file with the given name. It
starts by initializing the sample_size and record_size variables. record_size is
calculated based on the sample rate, size, and desired recording time. Let’s dig
into the essential sections;
File file = SD.open(fileName.c_str(), FILE_WRITE);
// Write the header to the WAV file
uint8_t wav_header[WAV_HEADER_SIZE];
generate_wav_header(wav_header, record_size, SAMPLE_RATE);
file.write(wav_header, WAV_HEADER_SIZE);
This section of the code opens the file on the SD card for writing and then
generates the .wav file header using the generate_wav_header function. It then
writes the header to the file.
// PSRAM malloc for recording
rec_buffer = (uint8_t *)ps_malloc(record_size);
if (rec_buffer == NULL) {
Serial.printf("malloc failed!\n");
while(1) ;
}
Serial.printf("Buffer: %d bytes\n", ESP.getPsramSize()
- ESP.getFreePsram());
The ps_malloc function allocates memory in the PSRAM for the recording.
If the allocation fails (i.e., rec_buffer is NULL), it prints an error message and
halts execution.
// Start recording
esp_i2s::i2s_read(esp_i2s::I2S_NUM_0,
rec_buffer,
record_size,
&sample_size,
portMAX_DELAY);
if (sample_size == 0) {
Serial.printf("Record Failed!\n");
} else {
Serial.printf("Record %d bytes\n", sample_size);
}
Dataset 1274
The i2s_read function reads audio data from the microphone into rec_buffer.
It prints an error message if no data is read (sample_size is 0).
// Increase volume
for (uint32_t i = 0; i < sample_size; i += SAMPLE_BITS/8) {
(*(uint16_t *)(rec_buffer+i)) <<= VOLUME_GAIN;
}
This section of the code increases the recording volume by shifting the sample
values by VOLUME_GAIN.
free(rec_buffer);
file.close();
Serial.printf("Recording complete: \n");
Serial.printf("Send rec for a new sample or enter
a new label\n\n");
Finally, the audio data is written to the .wav file. If the write operation fails,
it prints an error message. After writing, the memory allocated for rec_buffer
is freed, and the file is closed. The function finishes by printing a completion
message and prompting the user to send a new command.
Send the label (for example, yes). The program will wait for another com-
mand: rec
And the program will start recording new samples every time a command
rec is sent. The files will be saved as yes.1.wav, yes.2.wav, yes.3.wav, etc., until a
new label (for example, no) is sent. In this case, you should send the command
rec for each new sample, which will be saved as no.1.wav, no.2.wav, no.3.wav,
etc.
Note that any app, such as Audacity, can be used for audio recording
or even your computer.
Once the project is created, select the Upload Existing Data tool in the Data
Acquisition section. Choose the files to be uploaded:
And upload them to the Studio (You can automatically split data in train/test).
Repeat to all classes and all raw data.
All data on Pete’s dataset have a 1 s length, but the samples recorded in the
previous section have 10 s and must be split into 1s samples to be compatible.
Click on three dots after the sample name and select Split sample.
Once inside the tool, split the data into 1-second records. If necessary, add or
remove segments:
We can optionally check all datasets using the tab Data Explorer.
First, we will take the data points with a 1-second window, augmenting the
data, sliding that window each 500 ms. Note that the option zero-pad data is
set. It is essential to fill with zeros samples smaller than 1 second (in some cases,
I reduced the 1000 ms window on the split tool to avoid noises and spikes).
Each 1-second audio sample should be pre-processed and converted to an
image (for example, 13 × 49 × 1). We will use MFCC, which extracts features
from audio signals using Mel Frequency Cepstral CoefÏcients, which are great
for the human voice.
Next, we select KERAS for classification and build our model from scratch
by doing Image Classification using Convolution Neural Network).
Training model with Edge Impulse Studio 1280
Pre-Processing (MFCC)
The next step is to create the images to be trained in the next phase:
We can keep the default parameter values or take advantage of the DSP
Autotuneparameters option, which we will do.
The result will not spend much memory to pre-process data (only 16KB). Still,
the estimated processing time is high, 675 ms for an Espressif ESP-EYE (the
closest reference available), with a 240 kHz clock (same as our device), but with
a smaller CPU (XTensa LX6, versus the LX7 on the ESP32S). The real inference
time should be smaller.
Suppose we need to reduce the inference time later. In that case, we should
return to the pre-processing stage and, for example, reduce the FFT length to
256, change the Number of coefÏcients, or another parameter.
For now, let’s keep the parameters defined by the Autotuning tool. Save
parameters and generate the features.
If you want to understand what is happening “under the hood,” you can
download the dataset and run a Jupyter Notebook playing with the code. For
example, you can analyze the accuracy by each epoch:
Testing 1282
This CoLab Notebook can explain how you can go further: KWS Classifier
Project - Looking “Under the hood Training/xiao_esp32s3_keyword_spotting_-
project_nn_classifier.ipynb).”
Testing
Testing the model with the data put apart before training (Test Data), we got an
accuracy of approximately 87%.
Inspecting the F1 score, we can see that for YES, we got 0.95, an excellent
result once we used this keyword to “trigger” our postprocessing stage (turn on
Keyword Spotting (KWS) 1283
the built-in LED). Even for NO, we got 0.90. The worst result is for unknown,
what is OK.
We can proceed with the project, but it is possible to perform Live Classifi-
cation using a smartphone before deployment on our device. Go to the Live
Classification section and click on Connect a Development board:
Your phone will be connected to the Studio. Select the option Classification
on the app, and when it is running, start testing your keywords, confirming
that the model is working with live and real data:
Deploy and Inference 1284
Now it is time for a real test. We will make inferences wholly disconnected
from the Studio. Let’s change one of the ESP32 code examples created when
you deploy the Arduino Library.
In your Arduino IDE, go to the File/Examples tab look for your project, and
select esp32/esp32_microphone:
Keyword Spotting (KWS) 1285
This code was created for the ESP-EYE built-in microphone, which should
be adapted for our device.
Start changing the libraries to handle the I2S bus:
By:
#include <I2S.h>
#define SAMPLE_RATE 16000U
#define SAMPLE_BITS 16
void setup()
{
...
I2S.setAllPins(-1, 42, 41, -1, -1);
if (!I2S.begin(PDM_MONO_MODE, SAMPLE_RATE, SAMPLE_BITS)) {
Serial.println("Failed to initialize I2S!");
while (1) ;
...
}
On the static void capture_samples(void* arg) function, replace the line 153
that reads data from I2S mic:
By:
By:
You can find the complete code on the project’s GitHub. Upload the sketch
to your board and test some real inferences:
Keyword Spotting (KWS) 1287
Postprocessing
Now that we know the model is working by detecting our keywords, let’s
modify the code to see the internal LED going on every time a YES is detected.
You should initialize the LED:
#define LED_BUILT_IN 21
...
void setup()
{
...
pinMode(LED_BUILT_IN, OUTPUT); // Set the pin as output
digitalWrite(LED_BUILT_IN, HIGH); //Turn off
...
}
And change the // print the predictions portion of the previous code (on
loop():
You can find the complete code on the project’s GitHub. Upload the sketch
to your board and test some real inferences:
Conclusion 1288
The idea is that the LED will be ON whenever the keyword YES is detected.
In the same way, instead of turning on an LED, this could be a “trigger” for an
external device, as we saw in the introduction.
Conclusion
The Seeed XIAO ESP32S3 Sense is a giant tiny device! However, it is powerful,
trustworthy, not expensive, low power, and has suitable sensors to be used on
the most common embedded machine learning applications such as vision and
sound. Even though Edge Impulse does not ofÏcially support XIAO ESP32S3
Sense (yet!), we realized that using the Studio for training and deployment is
straightforward.
On my GitHub repository, you will find the last version all the
codeused on this project and the previous ones of the XIAO ESP32S3
series.
Before we finish, consider that Sound Classification is more than just voice.
For example, you can develop TinyML projects around sound in several areas,
such as:
• Security (Broken Glass detection)
• Industry (Anomaly Detection)
• Medical (Snore, Toss, Pulmonary diseases)
• Nature (Beehive control, insect sound)
Resources
• XIAO ESP32S3 Codes
• Subset of Google Speech Commands Dataset
Keyword Spotting (KWS) 1289
Overview
The XIAO ESP32S3 Sense, with its built-in camera and mic, is a versatile device.
But what if you need to add another type of sensor, such as an IMU? No problem!
1291
Installing the IMU 1292
One of the standout features of the XIAO ESP32S3 is its multiple pins that can
be used as an I2C bus (SDA/SCL pins), making it a suitable platform for sensor
integration.
Usually, the libraries available are for MPU6050, but they work for
both devices.
Connecting the HW
Connect the IMU to the XIAO according to the below diagram:
• MPU6050 SCL –> XIAO D5
• MPU6050 SDA –> XIAO D4
• MPU6050 VCC –> XIAO 3.3V
• MPU6050 GND –> XIAO GND
/*
* Based on I2C device class (I2Cdev) Arduino sketch for MPU6050 class
by Jeff Rowberg <[email protected]>
* and Edge Impulse Data Forwarder Exampe (Arduino)
- https://docs.edgeimpulse.com/docs/cli-data-forwarder
*
* Developed by M.Rovai @11May23
*/
#include "I2Cdev.h"
#include "MPU6050.h"
#include "Wire.h"
#define FREQUENCY_HZ 50
#define INTERVAL_MS (1000 / (FREQUENCY_HZ + 1))
#define ACC_RANGE 1 // 0: -/+2G; 1: +/-4G
MPU6050 imu;
int16_t ax, ay, az;
void setup() {
Serial.begin(115200);
// initialize device
Serial.println("Initializing I2C devices...");
Wire.begin();
imu.initialize();
delay(10);
// // verify connection
// if (imu.testConnection()) {
// Serial.println("IMU connected");
// }
// else {
// Serial.println("IMU Error");
// }
delay(300);
imu.setZAccelOffset(8867);
imu.setXGyroOffset(61);
imu.setYGyroOffset(-73);
imu.setZGyroOffset(35);
void loop() {
// converting to m/s^2^
float ax_m_s^2^ = ax * CONVERT_G_TO_MS2;
float ay_m_s^2^ = ay * CONVERT_G_TO_MS2;
float az_m_s^2^ = az * CONVERT_G_TO_MS2;
Serial.print(ax_m_s^2^);
Serial.print("\t");
Serial.print(ay_m_s^2^);
Serial.print("\t");
Serial.println(az_m_s^2^);
}
}
When you ran the code with the IMU resting over your table, the accelerom-
eter data shown on the Serial Monitor should be around 0.00, 0.00, and 9.81. If
the values are a lot different, you should calibrate the IMU.
The MCU6050 can be calibrated using the sketch: mcu6050-calibration.ino.
Run the code. The following will be displayed on the Serial Monitor:
Send any character (in the above example, “x”), and the calibration should
start.
In the end, you will receive the offset values to be used on all your sketches:
Move your device in the three axes. You should see the variation on Plotter:
The TinyML Motion Classification Project 1298
For our lab, we will simulate mechanical stresses in transport. Our problem
will be to classify four classes of movement:
• Maritime (pallets in boats)
• Terrestrial (palettes in a Truck or Train)
• Lift (Palettes being handled by Fork-Lift)
• Idle (Palettes in Storage houses)
So, to start, we should collect data. Then, accelerometers will provide the
data on the palette (or container).
From the above images, we can see that primarily horizontal movements
should be associated with the “Terrestrial class,” Vertical movements with the
“Lift Class,” no activity with the “Idle class,” and movement on all three axes
to Maritime class.
For data collection, we should first connect our device to the Edge Impulse
Studio, which will also be used for data pre-processing, model training, testing,
and deployment.
Once the XIAO ESP32S3 is not a fully supported development board by Edge
Impulse, we should, for example, use the CLI Data Forwarder to capture data
from our sensor and send it to the Studio, as shown in this diagram:
Motion Classification and Anomaly Detection 1299
Connect your device to the serial port and run the previous code to capture
IMU (Accelerometer) data, “printing them” on the serial. This will allow the
Edge Impulse Studio to “capture” them.
Go to the Edge Impulse page and create a project.
Start the CLI Data Forwarderon your terminal, entering (if it is the first time)
the following command:
edge-impulse-data-forwarder --clean
Next, enter your EI credentials and choose your project, variables, and device
names:
Go to your EI Project and verify if the device is connected (the dot should be
green):
Data Collection
As discussed before, we should capture data from all four Transportation
Classes. Imagine that you have a container with a built-in accelerometer:
Motion Classification and Anomaly Detection 1301
You can capture, for example, 2 minutes (twelve samples of 10 seconds each)
for the four classes. Using the “3 dots” after each one of the samples, select 2,
moving them for the Test set (or use the automatic Train/Test Split tool on the
Danger Zone of Dashboard tab). Below, you can see the result datasets:
Data Pre-Processing
The raw data type captured by the accelerometer is a “time series” and should
be converted to “tabular data”. We can do this conversion using a sliding
window over the sample data. For example, in the below figure,
Motion Classification and Anomaly Detection 1303
We can see 10 seconds of accelerometer data captured with a sample rate (SR)
of 50 Hz. A 2-second window will capture 300 data points (3 axis × 2 seconds
× 50 samples). We will slide this window each 200ms, creating a larger dataset
where each instance has 300 raw features.
You should use the best SR for your case, considering Nyquist’s
theorem, which states that a periodic signal must be sampled at
more than twice the signal’s highest frequency component.
Data preprocessing is a challenging area for embedded machine learning.
Still, Edge Impulse helps overcome this with its digital signal processing (DSP)
preprocessing step and, more specifically, the Spectral Features.
On the Studio, this dataset will be the input of a Spectral Analysis block, which
is excellent for analyzing repetitive motion, such as data from accelerometers.
This block will perform a DSP (Digital Signal Processing), extracting features
such as “FFT” or “Wavelets”. In the most common case, FFT, the Time Domain
Statistical features per axis/channel are:
• RMS
• Skewness
• Kurtosis
And the Frequency Domain Spectral features per axis/channel are:
• Spectral Power
• Skewness
• Kurtosis
For example, for an FFT length of 32 points, the Spectral Analysis Block’s
resulting output will be 21 features per axis (a total of 63 features).
Those 63 features will be the Input Tensor of a Neural Network Classifier and
the Anomaly Detection model (K-Means).
You can learn more by digging into the lab DSP Spectral Features
Model Design
Our classifier will be a Dense Neural Network (DNN) that will have 63 neurons
on its input layer, two hidden layers with 20 and 10 neurons, and an output
layer with four neurons (one per each class), as shown here:
Impulse Design 1304
Impulse Design
An impulse takes raw data, uses signal processing to extract features, and then
uses a learning block to classify new data.
We also take advantage of a second model, the K-means, that can be used for
Anomaly Detection. If we imagine that we could have our known classes as
clusters, any sample that could not fit on that could be an outlier, an anomaly
(for example, a container rolling out of a ship on the ocean).
Generating features
At this point in our project, we have defined the pre-processing method and
the model designed. Now, it is time to have the job done. First, let’s take the
raw data (time-series type) and convert it to tabular data. Go to the Spectral
Features tab and select Save Parameters:
At the top menu, select the Generate Features option and the Generate Fea-
tures button. Each 2-second window data will be converted into one data point
of 63 features.
The Feature Explorer will show those data in 2D using UMAP. Uni-
form Manifold Approximation and Projection (UMAP) is a dimen-
sion reduction technique that can be used for visualization similarly
to t-SNE but also for general non-linear dimension reduction.
The visualization allows one to verify that the classes present an excellent
separation, which indicates that the classifier should work well.
Training 1306
Optionally, you can analyze the relative importance of each feature for one
class compared with other classes.
Training
Our model has four layers, as shown below:
For anomaly detection, we should choose the suggested features that are
precisely the most important in feature extraction. The number of clusters will
be 32, as suggested by the Studio:
Motion Classification and Anomaly Detection 1307
Testing
Using 20% of the data left behind during the data capture phase, we can verify
how our model will behave with unknown data; if not 100% (what is expected),
the result was not that good (8%), mainly due to the terrestrial class. Once
we have four classes (which output should add 1.0), we can set up a lower
threshold for a class to be considered valid (for example, 0.4):
You should also use your device (which is still connected to the Studio) and
perform some Live Classification.
Deploy 1308
Be aware that here you will capture real data with your device and
upload it to the Studio, where an inference will be taken using the
trained model (But the model is NOT in your device).
Deploy
Now it is time for magic! The Studio will package all the needed libraries,
preprocessing functions, and trained models, downloading them to your com-
puter. You should select the option Arduino Library, and at the bottom, choose
Quantized (Int8) and Build. A Zip file will be created and downloaded to your
computer.
On your Arduino IDE, go to the Sketch tab, select the option Add.ZIP Library,
and Choose the.zip file downloaded by the Studio:
Inference
Now, it is time for a real test. We will make inferences that are wholly discon-
nected from the Studio. Let’s change one of the code examples created when
you deploy the Arduino Library.
In your Arduino IDE, go to the File/Examples tab and look for your project,
and on examples, select nano_ble_sense_accelerometer:
Motion Classification and Anomaly Detection 1309
Of course, this is not your board, but we can have the code working with
only a few changes.
For example, at the beginning of the code, you have the library related to
Arduino Sense IMU:
/* Includes -------------------------------------------- */
#include <XIAO-ESP32S3-Motion-Classification_inferencing.h>
#include <Arduino_LSM9DS1.h>
Change the “includes” portion with the code related to the IMU:
#include <XIAO-ESP32S3-Motion-Classification_inferencing.h>
#include "I2Cdev.h"
#include "MPU6050.h"
#include "Wire.h"
On the setup function, initiate the IMU set the off-set values and range:
// initialize device
Serial.println("Initializing I2C devices...");
Wire.begin();
imu.initialize();
delay(10);
imu.setFullScaleAccelRange(ACC_RANGE);
Inference 1310
At the loop function, the buffers buffer[ix], buffer[ix + 1], and buffer[ix + 2]
will receive the 3-axis data captured by the accelerometer. On the original code,
you have the line:
You should change the order of the following two blocks of code. First, you
make the conversion to raw data to “Meters per squared second (ms2 )”, followed
by the test regarding the maximum acceptance range (that here is in ms2 , but
on Arduino, was in Gs):
buffer[ix + 0] *= CONVERT_G_TO_MS2;
buffer[ix + 1] *= CONVERT_G_TO_MS2;
buffer[ix + 2] *= CONVERT_G_TO_MS2;
And that is it! You can now upload the code to your device and proceed with
the inferences. The complete code is available on the project’s GitHub.
Now you should try your movements, seeing the result of the inference of
each class on the images:
Motion Classification and Anomaly Detection 1311
And, of course, some “anomaly”, for example, putting the XIAO upside-
down. The anomaly score will be over 1:
Conclusion 1312
Conclusion
Regarding the IMU, this project used the low-cost MPU6050 but could also use
other IMUs, for example, the LCM20600 (6-axis), which is part of the Seeed
Grove - IMU 9DOF (lcm20600+AK09918). You can take advantage of this sensor,
which has integrated a Grove connector, which can be helpful in the case you
use the XIAO with an extension board, as shown below:
You can follow the instructions here to connect the IMU with the MCU. Only
note that for using the Grove ICM20600 Accelerometer, it is essential to update
the files I2Cdev.cpp and I2Cdev.h that you will download from the library
provided by Seeed Studio. For that, replace both files from this link. You can
find a sketch for testing the IMU on the GitHub project: accelerometer_test.ino.
On the projet’s GitHub repository, you will find the last version of
all codeand other docs: XIAO-ESP32S3 - IMU.
Motion Classification and Anomaly Detection 1313
Resources
• XIAO ESP32S3 Codes
• Edge Impulse Spectral Features Block Colab Notebook
• Edge Impulse Project
Grove Vision AI V2
1315
Pre-requisites 1316
Pre-requisites
• Grove Vision AI V2 Board: Ensure you have the Grove Vision AI V2
Board.
• Raspberry Pi OV5647 Camera Module: The camera should be connected
to the Grove Vision AI V2 Board for image capture.
• Master Controller: Can be a Seeed XIAO ESP32S3, a XIAO ESP32C6, or
other devices.
• USB-C Cable: This is for connecting the board to your computer.
• Network: With internet access for downloading the necessary software.
• XIAO Expansion Board Base: This helps connect the Master Device to
the Physical World (optional).
Exercises
In this Lab, we will explore computer vision (CV) applications using the Seeed
Studio Grove Vision AI Module V2, a powerful yet compact device specifically
designed for embedded machine learning applications. Based on the Himax
WiseEye2 chip, this module is designed to enable AI capabilities on edge
devices, making it an ideal tool for Edge Machine Learning (ML) applications.
Introduction
1317
Introduction 1318
With interfaces like IIC, UART, SPI, and Type-C, the Grove Vision AI (V2)
can be easily connected to devices such as XIAO, Raspberry Pi, BeagleBoard,
and ESP-based products for further development. For instance, integrating
Grove Vision AI V2 with one of the devices from the XIAO family makes it easy
to access the data resulting from inference on the device through the Arduino
IDE or MicroPython, and conveniently connect to the cloud or dedicated servers,
such as Home Assistance.
Camera Installation
Having the Grove Vision AI (V2) and camera ready, you can connect, for exam-
ple, a Raspberry Pi OV5647 Camera Module via the CSI cable.
Models can also be deployed using the SenseCraft Web Toolkit, a simplified
version of the SenseCraft AI Studio.
But in our case, let’s follow the steps below to start the SenseCraft-Web-
Toolkit:
Exploring CV AI models
Object Detection
Object detection is a pivotal technology in computer vision that focuses on
identifying and locating objects within digital images or video frames. Unlike
image classification, which categorizes an entire image into a single label, object
detection recognizes multiple objects within the image and determines their
precise locations, typically represented by bounding boxes. This capability
is crucial for a wide range of applications, including autonomous vehicles,
security, surveillance systems, and augmented reality, where understanding
the context and content of the visual environment is essential.
Common architectures that have set the benchmark in object detection include
the YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector), FOMO
(Faster Objects, More Objects), and Faster R-CNN (Region-based Convolutional
Neural Networks) models.
Let’s choose one of the ready-to-use AI models, such as Person Detection,
which was trained using the Swift-YOLO algorithm.
Exploring CV AI models 1324
Once the model is uploaded successfully, you can see the live feed from
the Grove Vision AI (V2) camera in the Preview area on the right. Also, the
inference details can be shown on the Serial Monitor by clicking on the [Device
Log] button at the top.
If we point the camera at an image with several people, we will get one box
for each person (object):
Power Consumption
The peak power consumption running this Swift-YOLO model was 410 milli-
watts.
Preview Settings
We can see that in the Settings, two settings options can be adjusted to
optimize the model’s recognition accuracy.
• Confidence: Refers to the level of certainty or probability assigned to its
predictions by a model. This value determines the minimum confidence
Exploring CV AI models 1326
level required for the model to consider a detection as valid. A higher con-
fidence threshold will result in fewer detections but with higher certainty,
while a lower threshold will allow more detections but may include some
false positives.
• IoU: Used to assess the accuracy of predicted bounding boxes compared to
truth bounding boxes. IoU is a metric that measures the overlap between
the predicted bounding box and the ground truth bounding box. It is
used to determine the accuracy of the object detection. The IoU threshold
sets the minimum IoU value required for a detection to be considered a
true positive. Adjusting this threshold can help in fine-tuning the model’s
precision and recall.
Pose/Keypoint Detection
Pose or keypoint detection is a sophisticated area within computer vision that
focuses on identifying specific points of interest within an image or video
frame, often related to human bodies, faces, or other objects of interest. This
technology can detect and map out the various keypoints of a subject, such as
the joints on a human body or the features of a face, enabling the analysis of
postures, movements, and gestures. This has profound implications for various
applications, including augmented reality, human-computer interaction, sports
analytics, and healthcare monitoring, where understanding human motion and
activity is crucial.
Unlike general object detection, which identifies and locates objects, pose
detection drills down to a finer level of detail, capturing the nuanced positions
and orientations of specific parts. Leading architectures in this field include
OpenPose, AlphaPose, and PoseNet, each designed to tackle the challenges of
pose estimation with varying degrees of complexity and precision. Through
advancements in deep learning and neural networks, pose detection has become
Setup and No-Code Applications 1327
increasingly accurate and efÏcient, offering real-time insights into the intricate
dynamics of subjects captured in visual data.
So, let’s explore this popular CV application, Pose/Keypoint Detection.
Stop the current model inference by pressing [Stop] in the Preview area.
Select the model and press [Send]. Once the model is uploaded successfully,
you can view the live feed from the Grove Vision AI (V2) camera in the Preview
area on the right, along with the inference details displayed in the Serial Monitor
(accessible by clicking the [Device Log] button at the top).
The YOLOV8 Pose model was trained using the COCO-Pose Dataset, which
contains 200K images labeled with 17 keypoints for pose estimation tasks.
Let’s look at a single screenshot of the inference (to simplify, let’s analyse an
image with a single person in it). We can note that we have two lines, one with
the inference performance in milliseconds (121 ms) and a second line with the
keypoints as below:
• 1 box of info, the same as we got with the object detection example (box
coordinates (113, 119, 67, 208), inference result (90), label (0).
• 17 groups of 4 numbers represent the 17 “joints” of the body, where ‘0’ is
the nose, ‘1’ and ‘2’ are the eyes, ‘15’ and’ 16’ are the feet, and so on.
Exploring CV AI models 1328
Image Classification
Image classification is a foundational task within computer vision aimed at
categorizing entire images into one of several predefined classes. This process
involves analyzing the visual content of an image and assigning it a label from
a fixed set of categories based on the predominant object or scene it contains.
Image classification is crucial in various applications, ranging from organiz-
ing and searching through large databases of images in digital libraries and
social media platforms to enabling autonomous systems to comprehend their
surroundings. Common architectures that have significantly advanced the field
of image classification include Convolutional Neural Networks (CNNs), such as
AlexNet, VGGNet, and ResNet. These models have demonstrated remarkable
accuracy on challenging datasets, such as ImageNet, by learning hierarchical
representations of visual data.
As the cornerstone of many computer vision systems, image classification
drives innovation, laying the groundwork for more complex tasks like object
detection and image segmentation, and facilitating a deeper understanding of
visual data across various industries. So, let’s also explore this computer vision
application.
Setup and No-Code Applications 1329
After the model is uploaded successfully, we can view the live feed from the
Grove Vision AI (V2) camera in the Preview area on the right, along with the
inference details displayed in the Serial Monitor (by clicking the [Device Log]
button at the top).
For example, [99, 1] means class: 1 (Person) with a score of 0.99. Once this
model is a binary classification, class 0 will be “No Person” (or Background).
The Inference latency is 15ms or around 70fps.
Power Consumption
To run the Mobilenet V2 0.35, the Grove Vision AI V2 had a peak current of
80mA at 5.24V, resulting in a power consumption of 420mW.
Running the same model on XIAO ESP32S3 Sense, the power consumption
was 523mW with a latency of 291ms.
An Image Classification Project 1330
The Goal
The first step is always to define a goal. Let’s classify, for example, two simple
objects—for instance, a toy box and a toy wheel. We should also include a 3rd
class of images, background, where no object is in the scene.
Data Collection
Let’s create the classes, following, for example, an alphabetical order:
• Class1: background
• Class 2: box
• Class 3: wheel
Setup and No-Code Applications 1333
Select one of the classes and keep pressing the green button under the preview
area. The collected images will appear on the Image Samples Screen.
After collecting the images, review them and delete any incorrect ones.
Training
Confirm if the correct device is selected (Grove Vision AI V2) and press [Start
Training]
Test
After training, the inference result can be previewed.
Note that the model is not running on the device. We are, in fact,
only capturing the images with the device and performing a live
preview using the training model, which is running in the Studio.
Deployment
Select the trained model on [Deploy to device], select the Grove Vision AI
V2:
The Studio will redirect us to the Vision Workplace tab. Confirm the de-
ployment, select the appropriate Port, and connect it:
The model will be flashed into the device. After an automatic reset, the model
will start running on the device. On the Device Logger, we can see that the
inference has a latency of approximately 8 ms, corresponding to a frame rate
of 125 frames per second (FPS).
Also, note that it is possible to adjust the model’s confidence.
Conclusion 1336
Conclusion
In this lab, we explored several computer vision (CV) applications using the
Seeed Studio Grove Vision AI Module V2, demonstrating its exceptional capa-
bilities as a powerful yet compact device specifically designed for embedded
machine learning applications.
Performance Excellence: The Grove Vision AI V2 demonstrated remarkable
performance across multiple computer vision tasks. With its Himax WiseEye2
chip featuring a dual-core Arm Cortex-M55 and integrated ARM Ethos-U55
neural network unit, the device delivered:
• Image Classification: 15 ms inference time (67 FPS)
Setup and No-Code Applications 1337
Resources
SenseCraft AI Studio Instructions.
SenseCraft-Web-Toolkit website.
Resources 1338
SenseCraft AI Studio
Himax AI Web Toolkit
Himax examples
Image Classification
In this Lab, we will explore Image Classification using the Seeed Studio Grove
Vision AI Module V2, a powerful yet compact device specifically designed for
embedded machine learning applications. Based on the Himax WiseEye2 chip,
this module is designed to enable AI capabilities on edge devices, making it an
ideal tool for Edge Machine Learning (ML) applications.
1339
Introduction 1340
Introduction
So far, we have explored several computer vision models previously uploaded
by Seeed Studio or used the SenseCraft AI Studio for Image Classification,
without choosing a specific model. Let’s now develop our Image Classification
project from scratch, where we will select our data and model.
Below, we can see the project’s main steps and where we will work with
them:
Project Goal
The first step in any machine learning (ML) project is defining the goal. In this
case, the goal is to detect and classify two specific objects present in a single
image. For this project, we will use two small toys: a robot and a small Brazilian
parrot (named Periquito). Also, we will collect images of a background where
those two objects are absent.
Data Collection
With the Machine Learning project goal defined, dataset collection is the next
and most crucial step. Suppose your project utilizes images that are publicly
available on datasets, for example, to be used on a Person Detection project. In
that case, you can download the Wake Vision dataset for use in the project.
Image Classification 1341
But, in our case, we define a project where the images do not exist publicly,
so we need to generate them. We can use a phone, computer camera, or other
devices to capture the photos, ofÒine or connected to the Edge Impulse Studio.
If you want to use the Grove Vision AI V2 to capture your dataset, you can
use the SenseCraft AI Studio as we did in the previous Lab, or the camera_-
web_server sketch as we will describe later in the Postprocessing / Getting the
Video Stream section of this Lab.
In this Lab, we will use the SenseCraft AI Studio to collect the dataset.
Image Collection
Let’s create the classes, following, for example, an alphabetical order:
• Class1: background
• Class 2: periquito
• Class 3: robot
Select one of the classes (note that a green line will be around the window)
and keep pressing the green button under the preview area. The collected
images will appear on the Image Samples Screen.
Image Classification 1343
After collecting the images, review them and, if necessary, delete any incorrect
ones.
Collect around 50 images from each class. After you collect the three classes,
open the menu on each of them and select Export Data.
In the Download area of the Computer, we will get three zip files, each one
with its corresponding class name. Each Zip file contains a folder with the
images.
So, starting from the raw images, we will resize them (96x96) pixels and feed
them to our Transfer Learning block:
Image Classification 1345
Also select the Target device (Himax WiseEye2 (M55 400 MHz + U55)) on
the up-right corner.
For example, you will get the highest accuracy with V2, 160x160 images, and
α=1.0. Of course, there is a trade-off. The higher the accuracy, the more memory
(around 1.3M RAM and 2.6M ROM) will be needed to run the model, implying
more latency. The smaller footprint will be obtained at another extreme with
MobileNet V1 and α=0.10 (around 53.2K RAM and 101K ROM).
For comparison, we will use the MobileNet V2 0.1 as our base
model (but a model with a greater alpha can be used here). The
final layer of our model, preceding the output layer, will have 8
neurons with a 10% dropout rate for preventing overfitting.
Another necessary technique to use with deep learning is data augmentation.
Data augmentation is a method that can help improve the accuracy of machine
learning models by creating additional artificial data. A data augmentation
system makes small, random changes to your training data during the training
process (such as flipping, cropping, or rotating the images).
Set the Hyperparameters:
• Epochs: 20,
• Bach Size: 32
• Learning Rate: 0.0005
• Validation size: 20%
Training result:
The model profile predicts 146 KB of RAM and 187 KB of Flash, indicating
no problem with the Grove AI Vision (V2), which has almost 2.5 MB of internal
SRAM. Additionally, the Studio indicates a latency of around 4 ms.
Despite this, with a 100% accuracy on the Validation set when using
the spare data for testing, we confirmed an Accuracy of 81%, using
the Quantized (Int8) trained model. However, it is sufÏcient for our
purposes in this lab.
Model Deployment
On the Deployment tab, we should select: Seeed Grove Vision AI Module
V2 (Himax WiseEye2) and press [Build]. A ZIP file will be downloaded to
our computer.
Image Classification 1347
We can flash the model following the instructions in the README.txt or use
the SenseCraft AI Studio. We will use the latter.
You should see the last model that was uploaded to the device. Select the
green button [Upload Model]. A pop-up window will ask for the model name,
the model file, and to enter the class names (objects). We should use labels
following alphabetical order: 0: background, 1: periquito, and 2: robot,
and then press [Send].
Introduction 1348
After a few seconds, the model will be uploaded (“flashed”) to our device,
and the camera image will appear in real-time on the Preview Sector. The
Classification result will be displayed under the image preview. It is also
possible to select the Confidence Threshold of your inference using the cursor
on Settings.
On the Device Logger, we can view the Serial Monitor, where we can observe
the latency, which is approximately 1 to 2 ms for pre-processing and 4 to 5 ms
for inference, aligning with the estimates made in Edge Impulse Studio.
Image Classification 1349
Postprocessing
Now that we have the model uploaded to the board and working correctly,
classifying our images, let’s connect a Master Device to export the inference
result to it and see the result completely ofÒine (disconnected from the PC and,
for example, powered by a battery).
The image processing and model inference are processed locally in Grove Vision
AI (V2), and we want the result to be output to the XIAO (Master Controller)
via IIC. For that, we will use the Arduino SSMA library. This library’s primary
purpose is to process Grove Vision AI’s data stream, which does not involve
model inference.
Step 1: Download the Arduino SSMA library as a zip file from its GitHub:
Image Classification 1351
Step 2: Install it in the Arduino IDE (sketch > Include Library > Add
.Zip Library).
Step 3: Now, connect the XIAO and Grove Vision AI (V2) via the socket (a
row of pins) located at the back of the device.
Step 6: In the Arduino IDE, select the Xiao board and the corresponding USB
port.
Once we want to stream the video to a webpage, we will use the XIAO
ESP32S3, which has wifi and enough memory to handle images. Select XIAO_-
ESP32S3 and the appropriate USB Port:
By default, the PSRAM is disabled. Open the Tools menu and on PSRAM:
"OPI PSRAM"select OPI PSRAM.
Introduction 1354
Open the address using a web browser. A Video App will be available. To
see only the video stream from the Grove Vision AI V2, press [Sample Only]
and [Start Stream].
If you want to create an image dataset, you can use this app, saving frames
of the video generated by the device. Pressing [Save Frame], the image will
be saved in the download area of our desktop.
Opening the App without selecting [Sample Only], the inference result
should appear on the video screen, but this does not happen for Image Classifi-
cation. For Object Detection or Pose Estimation, the result is embedded with
the video stream.
For example, if the model is a Person Detection using YoloV8:
Introduction 1356
#include <Seeed_Arduino_SSCMA.h>
SSCMA AI;
Image Classification 1357
void setup()
{
AI.begin();
Serial.begin(115200);
while (!Serial);
Serial.println("Inferencing - Grove AI V2 / XIAO ESP32S3");
void loop()
{
if (!AI.invoke()){
Serial.println("\nInvoke Success");
Serial.print("Latency [ms]: prepocess=");
Serial.print(AI.perf().prepocess);
Serial.print(", inference=");
Serial.print(AI.perf().inference);
Serial.print(", postpocess=");
Serial.println(AI.perf().postprocess);
int pred_index = AI.classes()[0].target;
Serial.print("Result= Label: ");
Serial.print(pred_index);
Serial.print(", score=");
Serial.println(AI.classes()[0].score);
turn_on_led(pred_index);
}
}
/**
* @brief turn_off_led function - turn-off the User LED
*/
void turn_off_led(){
digitalWrite(LED_BUILTIN, HIGH);
}
/**
* @brief turn_on_led function used to turn on the User LED
* @param[in] pred_index
* label 0: [0] ==> ALL OFF
* label 1: [1] ==> LED ON
* label 2: [2] ==> ALL OFF
Introduction 1358
The loop() function repeatedly calls the invoke() method to perform infer-
ence using the built-in algorithms of the Grove Vision AI Module V2. Upon
a successful inference, the sketch prints out performance metrics to the serial
monitor, including preprocessing, inference, and postprocessing times.
The sketch processes and prints out detailed information about the results of
the inference:
If the Robot is detected (Label:2) the LED is OFF (Same for Background
(Label:0):
Therefore, we can now power the Grove Viaon AI V2 + Xiao ESP32S3 with
an external battery, and the inference result will be displayed by the LED
completely ofÒine. The consumption is approximately 165 mA or 825 mW.
It is also possible to send the result using Wifi, BLE, or other com-
munication protocols available on the used Master Device.
The idea is to modify the previous sketch to handle the three external LEDs.
GOAL: Whenever the image of a Periquito is detected, the LED Green will
be ON; if it is a Robot, the LED Yellow will be ON; if it is a Background, the
LED Red will be ON.
The image processing and model inference are processed locally in Grove
Vision AI (V2), and we want the result to be output to the XIAO via IIC. For
that, we will use the Arduino SSMA library again.
Here the sketch to be used:
#include <Seeed_Arduino_SSCMA.h>
SSCMA AI;
void setup()
{
AI.begin();
Image Classification 1361
Serial.begin(115200);
while (!Serial);
Serial.println("Inferencing - Grove AI V2 / XIAO ESP32S3");
void loop()
{
if (!AI.invoke()){
Serial.println("\nInvoke Success");
Serial.print("Latency [ms]: prepocess=");
Serial.print(AI.perf().prepocess);
Serial.print(", inference=");
Serial.print(AI.perf().inference);
Serial.print(", postpocess=");
Serial.println(AI.perf().postprocess);
int pred_index = AI.classes()[0].target;
Serial.print("Result= Label: ");
Serial.print(pred_index);
Serial.print(", score=");
Serial.println(AI.classes()[0].score);
turn_on_leds(pred_index);
}
}
/**
* @brief turn_off_leds function - turn-off all LEDs
*/
void turn_off_leds(){
digitalWrite(LEDR, LOW);
digitalWrite(LEDY, LOW);
digitalWrite(LEDG, LOW);
}
/**
* @brief turn_on_leds function used to turn on a specific LED
* @param[in] pred_index
Introduction 1362
We should connect the Grove Vision AI V2 with the XIAO using its I2C
Grove connector. For the XIAO, we will use an Expansion Board for the facility
(although it is possible to connect the I2C directly to the XIAO’s pins). We will
power the boards using the USB-C connector, but a battery can also be used.
Image Classification 1363
Conclusion
In this lab, we’ve explored the complete process of developing an image classi-
fication system using the Seeed Studio Grove Vision AI Module V2 powered by
the Himax WiseEye2 chip. We’ve walked through every stage of the machine
learning workflow, from defining our project goals to deploying a working
model with real-world interactions.
The Grove Vision AI V2 has demonstrated impressive performance, with in-
ference times of just 4-5ms, dramatically outperforming other common tinyML
platforms. Our benchmark comparison showed it to be approximately 14 times
faster than ARM-M7 devices and over 100 times faster than an Xtensa LX6
(ESP-CAM). Even when compared to a Raspberry Pi Zero W2, the Edge TPU
architecture delivered nearly twice the speed while consuming less power.
Through this project, we’ve seen how transfer learning enables us to achieve
good classification results with a relatively small dataset of custom images.
The MobileNetV2 model with an alpha of 0.1 provided an excellent balance of
accuracy and efÏciency for our three-class problem, requiring only 146 KB of
RAM and 187 KB of Flash memory, well within the capabilities of the Grove
Vision AI Module V2’s 2.4 MB internal SRAM.
We also explored several deployment options, from viewing inference results
through the SenseCraft AI Studio to creating a standalone system with visual
feedback using LEDs. The ability to stream video to a web browser and process
inference results locally demonstrates the versatility of edge AI systems for
real-world applications.
The power consumption of our final system remained impressively low,
ranging from approximately 70mA (0.4W) for basic inference to 240mA (1.2W)
when driving external components. This efÏciency makes the Grove Vision AI
Module V2 an excellent choice for battery-powered applications where power
consumption is critical.
This lab has demonstrated that sophisticated computer vision tasks can
now be performed entirely at the edge, without reliance on cloud services or
Resources 1364
powerful computers. With tools like Edge Impulse Studio and SenseCraft AI
Studio, the development process has become accessible even to those without
extensive machine learning expertise.
As edge AI technology continues to evolve, we can expect even more powerful
capabilities from compact, energy-efÏcient devices like the Grove Vision AI
Module V2, opening up new possibilities for smart sensors, IoT applications,
and embedded intelligence in everyday objects.
Resources
Collecting Images with SenseCraft AI Studio.
Edge Impulse Studio Project
SenseCraft AI Studio - Vision Workplace (Deploy Models)
Other Himax examples
Arduino Sketches
Object Detection
1365
Raspberry Pi
These labs offer invaluable hands-on experience with machine learning systems,
leveraging the versatility and accessibility of the Raspberry Pi platform. Unlike
working with large-scale models that demand extensive cloud resources, these
exercises allow you to directly interact with hardware and software in a com-
pact yet powerful edge computing environment. You’ll gain practical insights
into deploying AI at the edge by utilizing Raspberry Pi’s capabilities, from the
efÏcient Pi Zero to the more robust Pi 4 or Pi 5 models. This approach provides
a tangible understanding of the challenges and opportunities in implement-
ing machine learning solutions in resource-constrained settings. While we’re
working at a smaller scale, the principles and techniques you’ll learn are funda-
mentally similar to those used in larger systems. The Raspberry Pi’s ability to
run a whole operating system and its extensive GPIO capabilities allow for a
rich learning experience that bridges the gap between theoretical knowledge
and real-world application. Through these labs, you’ll grasp the intricacies
of EdgeML and develop skills applicable to a wide range of AI deployment
scenarios.
Pre-requisites
• Raspberry Pi: Ensure you have at least one of the boards: the Raspberry
Pi Zero 2 W, Raspberry Pi 4 or 5 for the Vision Labs, and the Raspberry 5
for the GenAi labs.
1367
Setup 1368
Setup
• Setup Raspberry Pi
Exercises
This chapter will guide you through setting up Raspberry Pi Zero 2 W (Raspi-
Zero) and Raspberry Pi 5 (Raspi-5) models. We’ll cover hardware setup, operat-
ing system installation, initial configuration, and tests.
The general instructions for the Raspi-5 also apply to the older Rasp-
berry Pi versions, such as the Raspi-3 and Raspi-4.
1369
Overview 1370
Overview
The Raspberry Pi is a powerful and versatile single-board computer that has
become an essential tool for engineers across various disciplines. Developed by
the Raspberry Pi Foundation, these compact devices offer a unique combination
of affordability, computational power, and extensive GPIO (General Purpose
Input/Output) capabilities, making them ideal for prototyping, embedded
systems development, and advanced engineering projects.
Key Features
1. Computational Power: Despite their small size, Raspberry Pis offers
significant processing capabilities, with the latest models featuring multi-
core ARM processors and up to 8 GB of RAM.
2. GPIO Interface: The 40-pin GPIO header allows direct interaction with
sensors, actuators, and other electronic components, facilitating hardware-
software integration projects.
3. Extensive Connectivity: Built-in Wi-Fi, Bluetooth, Ethernet, and multiple
USB ports enable diverse communication and networking projects.
4. Low-Level Hardware Access: Raspberry Pis provides access to interfaces
like I2C, SPI, and UART, allowing for detailed control and communication
with external devices.
5. Real-Time Capabilities: With proper configuration, Raspberry Pis can
be used for soft real-time applications, making them suitable for control
systems and signal processing tasks.
6. Power EfÏciency: Low power consumption enables battery-powered and
energy-efÏcient designs, especially in models like the Pi Zero.
Engineering Applications
1. Embedded Systems Design: Develop and prototype embedded systems
for real-world applications.
2. IoT and Networked Devices: Create interconnected devices and explore
protocols like MQTT, CoAP, and HTTP/HTTPS.
Setup 1371
This tutorial will guide you through setting up the most common Raspberry
Pi models, enabling you to start on your machine learning project quickly. We’ll
cover hardware setup, operating system installation, and initial configuration,
focusing on preparing your Pi for Machine Learning applications.
Hardware Overview
Raspberry Pi Zero 2W
Raspberry Pi 5
• Processor:
– Pi 5: Quad-core 64-bit Arm Cortex-A76 CPU @ 2.4 GHz
– Pi 4: Quad-core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5 GHz
• RAM: 2 GB, 4 GB, or 8 GB options (8 GB recommended for AI tasks)
• Wireless: Dual-band 802.11ac wireless, Bluetooth 5.0
• Ports: 2 × micro HDMI ports, 2 × USB 3.0 ports, 2 × USB 2.0 ports, CSI
camera port, DSI display port
• Power: 5 V DC via USB-C connector (3A)
Key features:
1. Lightweight: Tailored to run efÏciently on the Pi’s hardware.
2. Versatile: Supports a wide range of applications and programming lan-
guages.
3. Open-source: Allows for customization and community-driven improve-
ments.
4. GPIO support: Enables interaction with sensors and other hardware
through the Pi’s pins.
5. Regular updates: Continuously improved for performance and security.
Embedded Linux on the Raspberry Pi provides a full-featured operating
system in a compact package, making it ideal for projects ranging from simple
IoT devices to more complex edge machine-learning applications. Its compati-
bility with standard Linux tools and libraries makes it a powerful platform for
development and experimentation.
Installation
To use the Raspberry Pi, we will need an operating system. By default, Rasp-
berry Pi checks for an operating system on any SD card inserted in the slot, so
we should install an operating system using Raspberry Pi Imager.
Raspberry Pi Imager is a tool for downloading and writing images on macOS,
Windows, and Linux. It includes many popular operating system images for
Raspberry Pi. We will also use the Imager to preconfigure credentials and
remote access settings.
Follow the steps to install the OS in your Raspi.
1. Download and install the Raspberry Pi Imager on your computer.
2. Insert a microSD card into your computer (a 32GB SD card is recom-
mended) .
Installing the Operating System 1374
Due to its reduced SDRAM (512 MB), the recommended OS for the
Raspi-Zero is the 32-bit version. However, to run some machine
learning models, such as the YOLOv8 from Ultralitics, we should use
the 64-bit version. Although Raspi-Zero can run a desktop, we will
choose the LITE version (no Desktop) to reduce the RAM needed
for regular operation.
• For Raspi-5: We can select the full 64-bit version, which includes a desktop:
Raspberry Pi OS (64-bit)
6. Click on Next and then the gear icon to access advanced options.
7. Set the hostname, the Raspi username and password, configure WiFi and
enable SSH (Very important!)
Initial Configuration
1. Insert the microSD card into your Raspberry Pi.
2. Connect power to boot up the Raspberry Pi.
3. Please wait for the initial boot process to complete (it may take a few
minutes).
You can find the most common Linux commands to be used with
the Raspi here or here.
Remote Access
SSH Access
The easiest way to interact with the Raspi-Zero is via SSH (“Headless”). You
can use a Terminal (MAC/Linux), PuTTy (Windows), or any other.
1. Find your Raspberry Pi’s IP address (for example, check your router).
2. On your computer, open a terminal and connect via SSH:
ssh username@[raspberry_pi_ip_address]
Remote Access 1376
Alternatively, if you do not have the IP address, you can try the following:
bash ssh [email protected] for example, ssh mjrovai@rpi-
5.local , ssh [email protected] , etc.
It means that you are interacting remotely with your Raspi. It is a good
practice to update/upgrade the system regularly. For that, you should
run:
sudo apt-get update
sudo apt upgrade
You should confirm the Raspi IP address. On the terminal, you can use:
hostname -I
You can use any text editor. In the same terminal, an option is the
nano.
To copy the file named test.txt from your personal computer to a user’s
home folder on your Raspberry Pi, run the following command from the di-
rectory containing test.txt, replacing the <username> placeholder with the
username you use to log in to your Raspberry Pi and the <pi_ip_address>
placeholder with your Raspberry Pi’s IP address:
Note that ~/ means that we will move the file to the ROOT of our
Raspi. You can choose any folder in your Raspi. But you should
create the folder before you run scp, since scp won’t create folders
automatically.
Remote Access 1378
For example, let’s transfer the file test.txt to the ROOT of my Raspi-zero,
which has an IP of 192.168.4.210:
I use a different profile to differentiate the terminals. The above action hap-
pens on your computer. Now, let’s go to our Raspi (using the SSH) and check
if the file is there:
Copy files from your Raspberry Pi. To copy a file named test.txt from a
user’s home directory on a Raspberry Pi to the current directory on another
computer, run the following command on your Host Computer:
$ scp <username>@<pi_ip_address>:myfile.txt .
For example:
On the Raspi, let’s create a copy of the file with another name:
cp test.txt test_2.txt
scp [email protected]:test_2.txt .
Setup 1379
Transferring files using FTP, such as FileZilla FTP Client, is also possible. Follow
the instructions, install the program for your Desktop OS, and use the Raspi IP
address as the Host. For example:
sftp://192.168.4.210
and enter your Raspi username and password. Pressing Quickconnect will
open two windows, one for your host computer desktop (right) and another
for the Raspi (left).
Using htop, a cross-platform interactive process viewer, you can easily monitor
the resources running on your Raspi, such as the list of processes, the running
CPUs, and the memory used in real-time. To lunch hop, enter with the command
on the terminal:
htop
Increasing SWAP Memory 1380
Regarding memory, among the devices in the Raspberry Pi family, the Raspi-
Zero has the smallest amount of SRAM (500 MB), compared to a selection of 2 GB
to 8 GB on the Raspis 4 or 5. For any Raspi, it is possible to increase the memory
available to the system with “Swap.” Swap memory, also known as swap space,
is a technique used in computer operating systems to temporarily store data
from RAM (Random Access Memory) on the SD card when the physical RAM
is fully utilized. This allows the operating system (OS) to continue running
even when RAM is full, which can prevent system crashes or slowdowns.
Swap memory benefits devices with limited RAM, such as the Raspi-Zero.
Increasing swap can help run more demanding applications or processes, but
it’s essential to balance this with the potential performance impact of frequent
disk access.
By default, the Rapi-Zero’s SWAP (Swp) memory is only 100 MB, which is
very small for running some more complex and demanding Machine Learning
applications (for example, YOLO). Let’s increase it to 2 MB:
First, turn off swap-file:
Next, you should open and change the file /etc/dphys-swapfile. For that,
we will use the nano:
CONF_SWAPSIZE=2000
When your device is rebooted (you should enter with the SSH again), you will
realize that the maximum swap memory value shown on top is now something
near 2 GB (in my case, 1.95 GB).
To keep the htop running, you should open another terminal window
to interact continuously with your Raspi.
Installing a Camera
sudo shutdown -h no
2. Connect the USB Webcam (USB Camera Module 30 fps, 1280 × 720) to
your Raspi (In this example, I am using the Raspi-Zero, but the instruc-
tions work for all Raspis).
Installing a Camera 1382
lsusb
fswebcam test_image.jpg
6. Since we are using SSH to connect to our Rapsi, we must transfer the
image to our main computer so we can view it. We can use FileZilla or
SCP for this:
Open a terminal on your host computer and run:
scp [email protected]:~/test_image.jpg .
7. If the image quality isn’t satisfactory, you can adjust various settings; for
example, define a resolution that is suitable for YOLO (640𝑥640):
Video Streaming
For stream video (which is more resource-intensive), we can install and use
mjpg-streamer:
First, install Git:
We can then access the stream by opening a web browser and navigating to:
http://<your_pi_ip_address>:8080. In my case: http://192.168.4.210:8080
We should see a webpage with options to view the stream. Click on the link
that says “Stream” or try accessing:
http://<raspberry_pi_ip_address>:8080/?action=stream
Any camera module will work on the Raspberry Pis, but for that, the configuration.txt
file must be updated:
At the bottom of the file, for example, to use the 5 MP Arducam OV5647
camera, add the line:
dtoverlay=ov5647,cam0
Or for the v2 module, which has the 8MP Sony IMX219 camera:
Setup 1387
dtoverlay=imx219,cam0
Save the file (CTRL+O [ENTER] CRTL+X) and reboot the Raspi:
Sudo reboot
libcamera-hello --list-cameras
Let’s capture a jpeg image with a resolution of 640 × 480 for testing and save
it to a file named test_cli_camera.jpg
if we want to see the file saved, we should use ls -f, which lists all current
directory content in long format. As before, we can use scp to view the image:
Running the Raspi Desktop remotely 1388
3. Once installed, you should confirm the Raspi IP address. For example,
on the terminal, you can use:
hostname -I
Running the Raspi Desktop remotely 1390
Model-Specific Considerations
Raspberry Pi Zero (Raspi-Zero)
• Limited processing power, best for lightweight projects
• It is better to use a headless setup (SSH) to conserve resources.
• Consider increasing swap space for memory-intensive tasks.
• It can be used for Image Classification and Object Detection Labs but not
for the LLM (SLM).
Overview
Image classification is a fundamental task in computer vision that involves
categorizing an image into one of several predefined classes. It’s a cornerstone
of artificial intelligence, enabling machines to interpret and understand visual
information in a way that mimics human perception.
1393
Overview 1394
mkdir Documents
cd Documents/
mkdir TFLITE
cd TFLITE/
mkdir IMG_CLASS
cd IMG_CLASS
mkdir models
cd models
To run Jupyter Notebook, run the command (change the IP address for yours):
On the terminal, you can see the local URL address to open the notebook:
You can access it from another device by entering the Raspberry Pi’s IP
address and the provided token in a web browser (you can copy the token from
the terminal).
Setting Up the Environment 1398
Define your working directory in the Raspi and create a new Python 3 note-
book.
print("NumPy:", np.__version__)
print("Pillow:", Image.__version__)
You can create the Python script using nano on the terminal, saving it with
CTRL+0 + ENTER + CTRL+X
Image Classification 1399
Let’s start a new notebook to follow all the steps to classify one image:
Making inferences with Mobilenet V2 1400
import time
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import tflite_runtime.interpreter as tflite
model_path = "./models/mobilenet_v2_1.0_224_quant.tflite"
interpreter = tflite.Interpreter(model_path=model_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
Input details will give us information about how the model should be fed
with an image. The shape of (1, 224, 224, 3) informs us that an image with
dimensions (224 × 224 × 3) should be input one by one (Batch Dimension: 1).
The output details show that the inference will result in an array of 1,001
integer values. Those values result from the image classification, where each
value is the probability of that specific label being related to the image.
input_dtype = input_details[0]['dtype']
input_dtype
dtype('uint8')
This shows that the input image should be raw pixels (0 - 255).
Let’s get a test image. You can transfer it from your computer or download
one for testing. Let’s first create a folder under our working directory:
mkdir images
cd images
wget https://upload.wikimedia.org/wikipedia/commons/3/3a/Cat03.jpg
# Load he image
img_path = "./images/Cat03.jpg"
img = Image.open(img_path)
That shows us that the image is an RGB image with a width of 1600 and a
height of 1600 pixels. So, to use our model, we should reshape it to (224, 224, 3)
and add a batch dimension of 1, as defined in input details: (1, 224, 224, 3). The
inference result, as shown in output details, will be an array with a 1001 size,
as shown below:
So, let’s reshape the image, add the batch dimension, and see the result:
img = img.resize((input_details[0]['shape'][1],
input_details[0]['shape'][2]))
input_data = np.expand_dims(img, axis=0)
input_data.shape
input_data.dtype
dtype('uint8')
The input data dtype is ‘uint8’, which is compatible with the dtype expected
for the model.
Using the input_data, let’s run the interpreter and get the predictions (out-
put):
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
predictions = interpreter.get_tensor(output_details[0]
['index'])[0]
The prediction is an array with 1001 elements. Let’s get the Top-5 indices
where their elements have high values:
Image Classification 1403
top_k_results = 5
top_k_indices = np.argsort(predictions)[::-1][:top_k_results]
top_k_indices
def load_labels(filename):
with open(filename, 'r') as f:
return [line.strip() for line in f.readlines()]
And get the list, printing the labels associated with the indexes:
labels_path = "./models/labels.txt"
labels = load_labels(labels_path)
print(labels[286])
print(labels[283])
print(labels[282])
print(labels[288])
print(labels[479])
As a result, we have:
Egyptian cat
tiger cat
tabby
lynx
carton
At least the four top indices are related to felines. The prediction content is the
probability associated with each one of the labels. As we saw on output details,
those values are quantized and should be dequantized and apply softmax.
print (probabilities[286])
print (probabilities[283])
print (probabilities[282])
print (probabilities[288])
print (probabilities[479])
0.27741462
0.3732285
0.16919471
0.10319158
0.023410844
For clarity, let’s create a function to relate the labels with the probabilities:
for i in range(top_k_results):
print("\t{:20}: {}%".format(
labels[top_k_indices[i]],
(int(probabilities[top_k_indices[i]]*100))))
# Preprocess
img = img.resize((input_details[0]['shape'][1],
input_details[0]['shape'][2]))
input_data = np.expand_dims(img, axis=0)
# Inference on Raspi-Zero
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
print("\n\t[PREDICTION] [Prob]\n")
for i in range(top_k_results):
print("\t{:20}: {}%".format(
labels[top_k_indices[i]],
(int(probabilities[top_k_indices[i]]*100))))
Let’s get a TFLite model trained from scratch. For that, you can follow the
Notebook:
The CNN trained model (cifar10_model.keras) had a size of 2.0MB. Using the
TFLite Converter, the model cifar10.tflite became with 674MB (around 1/3 of the
original size).
Installing Picamera2
Picamera2, a Python library for interacting with Raspberry Pi’s camera, is based
on the libcamera camera stack, and the Raspberry Pi foundation maintains it.
The Picamera2 library is supported on all Raspberry Pi models, from the Pi
Zero to the RPi 5. It is already installed system-wide on the Raspi, but we
should make it accessible within the virtual environment.
1. First, activate the virtual environment if it’s not already activated:
source ~/tflite/bin/activate
2. Now, let’s create a .pth file in your virtual environment to add the system
site-packages path:
echo "/usr/lib/python3/dist-packages" > \
$VIRTUAL_ENV/lib/python3.11/
site-packages/system_site_packages.pth
The above code will show the file location of the picamera2 module itself,
proving that the library can be accessed from the environment.
/home/mjrovai/tflite/lib/python3.11/site-packages/\
picamera2/__init__.py
>>> print(Picamera2.global_camera_info())
# Capture an image
picam2.capture_file("usb_camera_image.jpg")
print("Image captured and saved as 'usb_camera_image.jpg'")
Use the Nano text editor, the Jupyter Notebook, or any other editor. Save this
as a Python script (e.g., capture_image.py) and run it. This should capture an
image from your camera and save it as “usb_camera_image.jpg” in the same
directory as your script.
Image Classification 1409
If the Jupyter is open, you can see the captured image on your computer.
Otherwise, transfer the file from the Raspi to your computer.
If you are working with a Raspi-5 with a whole desktop, you can
open the file directly on the device.
The Goal
The first step in any ML project is to define its goal. In this case, it is to detect
and classify two specific objects present in one image. For this project, we will
use two small toys: a robot and a small Brazilian parrot (named Periquito). We
will also collect images of a background where those two objects are absent.
Image Classification Project 1410
Data Collection
Once we have defined our Machine Learning project goal, the next and most
crucial step is collecting the dataset. We can use a phone for the image capture,
but we will use the Raspi here. Let’s set up a simple web server on our Raspberry
Pi to view the QVGA (320 x 240) captured images in a browser.
1. First, let’s install Flask, a lightweight web framework for Python:
pip3 install flask
2. Let’s create a new Python script combining image capture with a web
server. We’ll call it get_img_data.py:
app = Flask(__name__)
# Global variables
base_dir = "dataset"
picam2 = None
frame = None
frame_lock = threading.Lock()
capture_counts = {}
current_label = None
shutdown_event = threading.Event()
def initialize_camera():
global picam2
picam2 = Picamera2()
config = picam2.create_preview_configuration(
main={"size": (320, 240)}
)
picam2.configure(config)
picam2.start()
time.sleep(2) # Wait for camera to warm up
def get_frame():
global frame
while not shutdown_event.is_set():
stream = io.BytesIO()
picam2.capture_file(stream, format='jpeg')
Image Classification 1411
with frame_lock:
frame = stream.getvalue()
time.sleep(0.1) # Adjust as needed for smooth preview
def generate_frames():
while not shutdown_event.is_set():
with frame_lock:
if frame is not None:
yield (b'--frame\r\n'
b'Content-Type: image/jpeg\r\n\r\n' +
frame + b'\r\n')
time.sleep(0.1) # Adjust as needed for smooth streaming
def shutdown_server():
shutdown_event.set()
if picam2:
picam2.stop()
# Give some time for other threads to finish
time.sleep(2)
# Send SIGINT to the main process
os.kill(os.getpid(), signal.SIGINT)
@app.route('/capture')
def capture_page():
return render_template_string('''
<!DOCTYPE html>
<html>
<head>
<title>Dataset Capture</title>
<script>
var shutdownInitiated = false;
function checkShutdown() {
if (!shutdownInitiated) {
fetch('/check_shutdown')
.then(response => response.json())
.then(data => {
if (data.shutdown) {
shutdownInitiated = true;
document.getElementById(
'video-feed').src = '';
document.getElementById(
'shutdown-message')
.style.display = 'block';
}
});
}
}
setInterval(checkShutdown, 1000); // Check
every second
</script>
</head>
<body>
<h1>Dataset Capture</h1>
<p>Current Label: {{ label }}</p>
<p>Images captured for this label: {{ capture_count
}}</p>
<img id="video-feed" src="{{ url_for('video_feed')
}}" width="640"
height="480" />
<div id="shutdown-message" style="display: none;
color: red;">
Capture process has been stopped.
You can close this window.
</div>
<form action="/capture_image" method="post">
<input type="submit" value="Capture Image">
</form>
<form action="/stop" method="post">
Image Classification 1413
@app.route('/video_feed')
def video_feed():
return Response(generate_frames(),
mimetype='multipart/x-mixed-replace;
boundary=frame')
@app.route('/capture_image', methods=['POST'])
def capture_image():
global capture_counts
if current_label and not shutdown_event.is_set():
capture_counts[current_label] += 1
timestamp = time.strftime("%Y%m%d-%H%M%S")
filename = f"image_{timestamp}.jpg"
full_path = os.path.join(base_dir, current_label,
filename)
picam2.capture_file(full_path)
return redirect(url_for('capture_page'))
@app.route('/stop', methods=['POST'])
def stop():
summary = render_template_string('''
<!DOCTYPE html>
<html>
<head>
<title>Dataset Capture - Stopped</title>
</head>
<body>
<h1>Dataset Capture Stopped</h1>
<p>The capture process has been stopped.
You can close this window.</p>
<p>Summary of captures:</p>
<ul>
{% for label, count in capture_counts.items() %}
Image Classification Project 1414
return summary
@app.route('/check_shutdown')
def check_shutdown():
return {'shutdown': shutdown_event.is_set()}
if __name__ == '__main__':
initialize_camera()
threading.Thread(target=get_frame, daemon=True).start()
app.run(host='0.0.0.0', port=5000, threaded=True)
python3 get_img_data.py
This Python script creates a web-based interface for capturing and organizing
image datasets using a Raspberry Pi and its camera. It’s handy for machine
learning projects that require labeled image data.
Key Features:
1. Web Interface: Accessible from any device on the same network as the
Raspberry Pi.
2. Live Camera Preview: This shows a real-time feed from the camera.
3. Labeling System: Allows users to input labels for different categories of
images.
4. Organized Storage: Automatically saves images in label-specific subdi-
rectories.
Image Classification 1415
5. Per-Label Counters: Keeps track of how many images are captured for
each label.
6. Summary Statistics: Provides a summary of captured images when
stopping the capture process.
Main Components:
1. Flask Web Application: Handles routing and serves the web interface.
2. Picamera2 Integration: Controls the Raspberry Pi camera.
3. Threaded Frame Capture: Ensures smooth live preview.
4. File Management: Organizes captured images into labeled directories.
Key Functions:
• initialize_camera(): Sets up the Picamera2 instance.
• get_frame(): Continuously captures frames for the live preview.
• generate_frames(): Yields frames for the live video feed.
• shutdown_server(): Sets the shutdown event, stops the camera, and
shuts down the Flask server
• index(): Handles the label input page.
• capture_page(): Displays the main capture interface.
• video_feed(): Shows a live preview to position the camera
• capture_image(): Saves an image with the current label.
• stop(): Stops the capture process and displays a summary.
Usage Flow:
1. Start the script on your Raspberry Pi.
2. Access the web interface from a browser.
3. Enter a label for the images you want to capture and press Start Capture.
Technical Notes:
• The script uses threading to handle concurrent frame capture and web
serving.
• Images are saved with timestamps in their filenames for uniqueness.
• The web interface is responsive and can be accessed from mobile devices.
Customization Possibilities:
Dataset
We will walk through four main steps using the EI Studio (or Studio). These
steps are crucial in preparing our model for use on the Raspi: Dataset, Impulse,
Tests, and Deploy (on the Edge Device, in this case, the Raspi).
The Studio allows you to explore your data, showing a complete view of all
the data in your project. You can clear, inspect, or change labels by clicking on
individual data items. In our case, a straightforward project, the data seems
OK.
By leveraging these learned features, we can train a new model for your spe-
cific task with fewer data and computational resources and achieve competitive
accuracy.
Image Pre-Processing
All the input QVGA/RGB565 images will be converted to 76,800 features (160 ×
160 × 3).
Press Save parameters and select Generate features in the next tab.
Model Design
MobileNet is a family of efÏcient convolutional neural networks designed for
mobile and embedded vision applications. The key features of MobileNet are:
1. Lightweight: Optimized for mobile devices and embedded systems with
limited computational resources.
2. Speed: Fast inference times, suitable for real-time applications.
3. Accuracy: Maintains good accuracy despite its compact size.
MobileNetV2, introduced in 2018, improves the original MobileNet architec-
ture. Key features include:
1. Inverted Residuals: Inverted residual structures are used where shortcut
connections are made between thin bottleneck layers.
2. Linear Bottlenecks: Removes non-linearities in the narrow layers to pre-
vent the destruction of information.
Image Classification 1421
Model Training
Exposure to these variations during training can help prevent your model
from taking shortcuts by “memorizing” superficial clues in your training data,
meaning it may better reflect the deep underlying patterns in your dataset.
The final dense layer of our model will have 0 neurons with a 10% dropout
for overfitting prevention. Here is the Training result:
The Impulse Design 1422
Recommendation:
1. If our application doesn’t require detecting tiny details and can tolerate
some loss in accuracy, reducing the input size is often the most effective
way to speed up inference.
2. Reducing alpha might be preferable if maintaining the ability to detect
fine details is crucial or if you need a more balanced trade-off between
speed and accuracy.
3. For best results, you might want to experiment with both:
• Try MobileNet V2 with input sizes like 160 × 160 or 92 × 92
• Experiment with alpha values like 1.0, 0.75, 0.5 or 0.35.
4. Always benchmark the different configurations on your specific hardware
and with your particular dataset to find the optimal balance for your use
case.
Model Testing
Now, you should take the data set aside at the start of the project and run the
trained model using it as input. Again, the result is excellent (92.22%).
Transfer the model from your computer to the Raspi (./models), for example,
using FileZilla. Also, capture some images for inference (./images).
Import the needed libraries:
import time
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import tflite_runtime.interpreter as tflite
img_path = "./images/robot.jpg"
model_path = "./models/ei-raspi-img-class-int8-quantized-\
model.tflite"
labels = ['background', 'periquito', 'robot']
Note that the models trained on the Edge Impulse Studio will output
values with index 0, 1, 2, etc., where the actual labels will follow an
alphabetic order.
Load the model, allocate the tensors, and get the input and output tensor
details:
One important difference to note is that the dtype of the input details of the
model is now int8, which means that the input values go from –128 to +127,
while each pixel of our image goes from 0 to 255. This means that we should
pre-process the image to match it. We can check here:
input_dtype = input_details[0]['dtype']
input_dtype
numpy.int8
So, let’s open the image and show it:
img = Image.open(img_path)
plt.figure(figsize=(4, 4))
plt.imshow(img)
plt.axis('off')
plt.show()
Checking the input data, we can verify that the input tensor is compatible
with what is expected by the model:
input_data.shape, input_data.dtype
Now, it is time to perform the inference. Let’s also calculate the latency of
the model:
# Inference on Raspi-Zero
start_time = time.time()
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
end_time = time.time()
inference_time = (end_time - start_time) * 1000 # Convert
# to milliseconds
print ("Inference time: {:.1f}ms".format(inference_time))
The model will take around 125ms to perform the inference in the Raspi-Zero,
which is 3 to 4 times longer than a Raspi-5.
Now, we can get the output labels and probabilities. It is also important
to note that the model trained on the Edge Impulse Studio has a softmax in
its output (different from the original Movilenet V2), and we should use the
model’s raw output as the “probabilities.”
print("\n\t[PREDICTION] [Prob]\n")
for i in range(top_k_results):
print("\t{:20}: {:.2f}%".format(
labels[top_k_indices[i]],
probabilities[top_k_indices[i]] * 100))
Image Classification 1427
Let’s modify the function created before so that we can handle different type
of models:
# Preprocess
img = img.resize((input_details[0]['shape'][1],
input_details[0]['shape'][2]))
input_dtype = input_details[0]['dtype']
if input_dtype == np.uint8:
input_data = np.expand_dims(np.array(img), axis=0)
elif input_dtype == np.int8:
scale, zero_point = input_details[0]['quantization']
img_array = np.array(img, dtype=np.float32) / 255.0
img_array = (
img_array / scale
+ zero_point
).clip(-128, 127).astype(np.int8)
input_data = np.expand_dims(img_array, axis=0)
else: # float32
input_data = np.expand_dims(
np.array(img, dtype=np.float32),
axis=0
) / 255.0
The Impulse Design 1428
# Inference on Raspi-Zero
start_time = time.time()
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
end_time = time.time()
inference_time = (end_time -
start_time
) * 1000 # Convert to milliseconds
# Obtain results
predictions = interpreter.get_tensor(output_details[0]
['index'])[0]
if apply_softmax:
# Apply softmax
exp_preds = np.exp(predictions - np.max(predictions))
probabilities = exp_preds / np.sum(exp_preds)
else:
probabilities = predictions
print("\n\t[PREDICTION] [Prob]\n")
for i in range(top_k_results):
print("\t{:20}: {:.1f}%".format(
labels[top_k_indices[i]],
probabilities[top_k_indices[i]] * 100))
print ("\n\tInference time: {:.1f}ms".format(inference_time))
And test it with different images and the int8 quantized model (160x160
alpha =1.0).
Image Classification 1429
Let’s download a smaller model, such as the one trained for the Nicla Vision
Lab (int8 quantized model, 96x96, alpha = 0.1), as a test. We can use the same
function:
The model lost some accuracy, but it is still OK once our model does not look
for many details. Regarding latency, we are around ten times faster on the
Raspi-Zero.
app = Flask(__name__)
# Global variables
picam2 = None
frame = None
frame_lock = threading.Lock()
is_classifying = False
confidence_threshold = 0.8
model_path = "./models/ei-raspi-img-class-int8-quantized-\
model.tflite"
labels = ['background', 'periquito', 'robot']
interpreter = None
classification_queue = Queue(maxsize=1)
def initialize_camera():
global picam2
picam2 = Picamera2()
config = picam2.create_preview_configuration(
main={"size": (320, 240)}
)
picam2.configure(config)
picam2.start()
time.sleep(2) # Wait for camera to warm up
def get_frame():
global frame
while True:
stream = io.BytesIO()
picam2.capture_file(stream, format='jpeg')
with frame_lock:
frame = stream.getvalue()
time.sleep(0.1) # Capture frames more frequently
def generate_frames():
while True:
with frame_lock:
if frame is not None:
yield (
b'--frame\r\n'
b'Content-Type: image/jpeg\r\n\r\n'
+ frame + b'\r\n'
)
time.sleep(0.1)
def load_model():
global interpreter
if interpreter is None:
Image Classification 1431
interpreter = tflite.Interpreter(model_path=model_path)
interpreter.allocate_tensors()
return interpreter
img = img.resize((input_details[0]['shape'][1],
input_details[0]['shape'][2]))
input_data = np.expand_dims(np.array(img), axis=0)\
.astype(input_details[0]['dtype'])
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
predictions = interpreter.get_tensor(output_details[0]
['index'])[0]
# Handle output based on type
output_dtype = output_details[0]['dtype']
if output_dtype in [np.int8, np.uint8]:
# Dequantize the output
scale, zero_point = output_details[0]['quantization']
predictions = (predictions.astype(np.float32) -
zero_point) * scale
return predictions
def classification_worker():
interpreter = load_model()
while True:
if is_classifying:
with frame_lock:
if frame is not None:
img = Image.open(io.BytesIO(frame))
predictions = classify_image(img, interpreter)
max_prob = np.max(predictions)
if max_prob >= confidence_threshold:
label = labels[np.argmax(predictions)]
else:
label = 'Uncertain'
classification_queue.put({
'label': label,
'probability': float(max_prob)
})
time.sleep(0.1) # Adjust based on your needs
@app.route('/')
Live Image Classification 1432
def index():
return render_template_string('''
<!DOCTYPE html>
<html>
<head>
<title>Image Classification</title>
<script
src="https://code.jquery.com/jquery-3.6.0.min.js">
</script>
<script>
function startClassification() {
$.post('/start');
$('#startBtn').prop('disabled', true);
$('#stopBtn').prop('disabled', false);
}
function stopClassification() {
$.post('/stop');
$('#startBtn').prop('disabled', false);
$('#stopBtn').prop('disabled', true);
}
function updateConfidence() {
var confidence = $('#confidence').val();
$.post('/update_confidence',
{confidence: confidence}
);
}
function updateClassification() {
$.get('/get_classification', function(data) {
$('#classification').text(data.label + ': '
+ data.probability.toFixed(2));
});
}
$(document).ready(function() {
setInterval(updateClassification, 100);
// Update every 100ms
});
</script>
</head>
<body>
<h1>Image Classification</h1>
<img src="{{ url_for('video_feed') }}"
width="640"
height="480" />
<br>
<button id="startBtn"
onclick="startClassification()">
Image Classification 1433
Start Classification
</button>
<button id="stopBtn"
onclick="stopClassification()"
disabled>
Stop Classification
</button>
<br>
<label for="confidence">Confidence Threshold:</label>
<input type="number"
id="confidence"
name="confidence"
min="0" max="1"
step="0.1"
value="0.8"
onchange="updateConfidence()" />
<br>
<div id="classification">
Waiting for classification...
</div>
</body>
</html>
''')
@app.route('/video_feed')
def video_feed():
return Response(
generate_frames(),
mimetype='multipart/x-mixed-replace; boundary=frame'
)
@app.route('/start', methods=['POST'])
def start_classification():
global is_classifying
is_classifying = True
return '', 204
@app.route('/stop', methods=['POST'])
def stop_classification():
global is_classifying
is_classifying = False
return '', 204
Live Image Classification 1434
@app.route('/update_confidence', methods=['POST'])
def update_confidence():
global confidence_threshold
confidence_threshold = float(request.form['confidence'])
return '', 204
@app.route('/get_classification')
def get_classification():
if not is_classifying:
return jsonify({'label': 'Not classifying',
'probability': 0})
try:
result = classification_queue.get_nowait()
except Queue.Empty:
result = {'label': 'Processing', 'probability': 0}
return jsonify(result)
if __name__ == '__main__':
initialize_camera()
threading.Thread(target=get_frame, daemon=True).start()
threading.Thread(target=classification_worker,
daemon=True).start()
app.run(host='0.0.0.0', port=5000, threaded=True)
python3 img_class_live_infer.py
Key Components:
1. Flask Web Application: Serves the user interface and handles requests.
2. PiCamera2: Captures images from the Raspberry Pi camera module.
3. TensorFlow Lite: Runs the image classification model.
4. Threading: Manages concurrent operations for smooth performance.
Main Features:
• Live camera feed display
• Real-time image classification
• Adjustable confidence threshold
• Start/Stop classification on demand
Code Structure:
1. Imports and Setup:
• Flask for web application
• PiCamera2 for camera control
• TensorFlow Lite for inference
• Threading and Queue for concurrent operations
2. Global Variables:
• Camera and frame management
• Classification control
• Model and label information
3. Camera Functions:
• initialize_camera(): Sets up the PiCamera2
• get_frame(): Continuously captures frames
• generate_frames(): Yields frames for the web feed
4. Model Functions:
• load_model(): Loads the TFLite model
• classify_image(): Performs inference on a single image
5. Classification Worker:
• Runs in a separate thread
• Continuously classifies frames when active
Conclusion: 1436
Key Concepts:
1. Concurrent Operations: Using threads to handle camera capture and
classification separately from the web server.
2. Real-time Updates: Frequent updates to the classification results without
page reloads.
3. Model Reuse: Loading the TFLite model once and reusing it for efÏciency.
4. Flexible Configuration: Allowing users to adjust the confidence threshold
on the fly.
Usage:
1. Ensure all dependencies are installed.
2. Run the script on a Raspberry Pi with a camera module.
3. Access the web interface from a browser using the Raspberry Pi’s IP
address.
4. Start classification and adjust settings as needed.
Conclusion:
Image classification has emerged as a powerful and versatile application of ma-
chine learning, with significant implications for various fields, from healthcare
to environmental monitoring. This chapter has demonstrated how to imple-
ment a robust image classification system on edge devices like the Raspi-Zero
and Raspi-5, showcasing the potential for real-time, on-device intelligence.
We’ve explored the entire pipeline of an image classification project, from
data collection and model training using Edge Impulse Studio to deploying
and running inferences on a Raspi. The process highlighted several key points:
Image Classification 1437
Resources
• Dataset Example
• Setup Test Notebook on a Raspi
• Image Classification Notebook on a Raspi
• CNN to classify Cifar-10 dataset at CoLab
• Cifar 10 - Image Classification on a Raspi
• Python Scripts
• Edge Impulse Project
Object Detection
Overview
Building upon our exploration of image classification, we now turn our atten-
tion to a more advanced computer vision task: object detection. While image
classification assigns a single label to an entire image, object detection goes
further by identifying and locating multiple objects within a single image. This
1439
Overview 1440
Throughout this lab, we’ll cover the fundamentals of object detection and
how it differs from image classification. We’ll also learn how to train, fine-tune,
test, optimize, and deploy popular object detection architectures using a dataset
created from scratch.
Evaluation Metrics
Object detection uses different metrics compared to image classification:
• Intersection over Union (IoU): Measures the overlap between predicted
and ground truth bounding boxes.
• Mean Average Precision (mAP): Combines precision and recall across
all classes and IoU thresholds.
• Frames Per Second (FPS): Measures detection speed, crucial for real-time
applications on edge devices.
You can test some common models online by visiting Object Detec-
tion - MediaPipe Studio
Object Detection 1443
On Kaggle, we can find the most common pre-trained tflite models to use
with the Raspi, ssd_mobilenet_v1, and EfÏcientDet. Those models were trained
on the COCO (Common Objects in Context) dataset, with over 200,000 labeled
images in 91 categories. Go, download the models, and upload them to the
./models folder in the Raspi.
Alternatively, you can find the models and the COCO labels on
GitHub.
For the first part of this lab, we will focus on a pre-trained 300 × 300 SSD-
Mobilenet V1 model and compare it with the 320 × 320 EfÏcientDet-lite0, also
trained using the COCO 2017 dataset. Both models were converted to a Tensor-
Flow Lite format (4.2 MB for the SSD Mobilenet and 4.6 MB for the EfÏcientDet).
source ~/tflite/bin/activate
cd Documents/TFLITE/
mkdir OBJ_DETECT
cd OBJ_DETECT
mkdir images
mkdir models
cd models
Let’s start a new notebook to follow all the steps to detect objects on an image:
Import the needed libraries:
import time
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import tflite_runtime.interpreter as tflite
model_path = "./models/ssd-mobilenet-v1-tflite-default-v1.tflite"
interpreter = tflite.Interpreter(model_path=model_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
Input details will inform us how the model should be fed with an image.
The shape of (1, 300, 300, 3) with a dtype of uint8 tells us that a non-
normalized (pixel value range from 0 to 255) image with dimensions (300 ×
300 × 3) should be input one by one (Batch Dimension: 1).
The output details include not only the labels (“classes”) and probabilities
(“scores”) but also the relative window position of the bounding boxes (“boxes”)
about where the object is located on the image and the number of detected
objects (“num_detections”). The output details also tell us that the model can
detect a maximum of 10 objects in the image.
Object Detection 1445
So, for the above example, using the same cat image used with the Image
Classification Lab looking for the output, we have a 76% probability of having
found an object with a class ID of 16 on an area delimited by a bounding box
of [0.028011084, 0.020121813, 0.9886069, 0.802299]. Those four numbers are
related to ymin, xmin, ymax and xmax, the box coordinates.
Taking into consideration that y goes from the top (ymin) to the bottom (ymax)
and x goes from left (xmin) to the right (xmax), we have, in fact, the coordinates
of the top/left corner and the bottom/right one. With both edges and knowing
the shape of the picture, it is possible to draw a rectangle around the object, as
shown in the figure below:
Next, we should find what class ID equal to 16 means. Opening the file coco_-
labels.txt, as a list, each element has an associated index, and inspecting
index 16, we get, as expected, cat. The probability is the value returning from
the score.
Let’s now upload some images with multiple objects on it for testing.
img_path = "./images/cat_dog.jpeg"
orig_img = Image.open(img_path)
Pre-Trained Object Detection Models Overview 1446
Based on the input details, let’s pre-process the image, changing its shape
and expanding its dimension:
img = orig_img.resize((input_details[0]['shape'][1],
input_details[0]['shape'][2]))
input_data = np.expand_dims(img, axis=0)
input_data.shape, input_data.dtype
The new input_data shape is(1, 300, 300, 3) with a dtype of uint8, which
is compatible with what the model expects.
Using the input_data, let’s run the interpreter, measure the latency, and get
the output:
start_time = time.time()
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
end_time = time.time()
inference_time = (end_time -
start_time) * 1000 # Convert to milliseconds
print ("Inference time: {:.1f}ms".format(inference_time))
boxes = interpreter.get_tensor(output_details[0]['index'])[0]
classes = interpreter.get_tensor(output_details[1]['index'])[0]
scores = interpreter.get_tensor(output_details[2]['index'])[0]
num_detections = int(interpreter.get_tensor(output_details[3]
['index'])[0])
Object Detection 1447
On a quick inspection, we can see that the model detected 2 objects with a
score over 0.5:
for i in range(num_detections):
if scores[i] > 0.5: # Confidence threshold
print(f"Object {i}:")
print(f" Bounding Box: {boxes[i]}")
print(f" Confidence: {scores[i]}")
print(f" Class: {classes[i]}")
plt.figure(figsize=(12, 8))
plt.imshow(orig_img)
for i in range(num_detections):
if scores[i] > 0.5: # Adjust threshold as needed
ymin, xmin, ymax, xmax = boxes[i]
(left, right, top, bottom) = (xmin * orig_img.width,
xmax * orig_img.width,
ymin * orig_img.height,
ymax * orig_img.height)
rect = plt.Rectangle((left, top), right-left, bottom-top,
fill=False, color='red', linewidth=2)
plt.gca().add_patch(rect)
class_id = int(classes[i])
class_name = labels[class_id]
plt.text(left, top-10, f'{class_name}: {scores[i]:.2f}',
color='red', fontsize=12, backgroundcolor='white')
Pre-Trained Object Detection Models Overview 1448
EfÏcientDet
EfÏcientDet is not technically an SSD (Single Shot Detector) model, but it shares
some similarities and builds upon ideas from SSD and other object detection
architectures:
1. EfÏcientDet:
• Developed by Google researchers in 2019
• Uses EfÏcientNet as the backbone network
• Employs a novel bi-directional feature pyramid network (BiFPN)
• It uses compound scaling to scale the backbone network and the
object detection components efÏciently.
2. Similarities to SSD:
• Both are single-stage detectors, meaning they perform object local-
ization and classification in a single forward pass.
• Both use multi-scale feature maps to detect objects at different scales.
3. Key differences:
• Backbone: SSD typically uses VGG or MobileNet, while EfÏcientDet
uses EfÏcientNet.
• Feature fusion: SSD uses a simple feature pyramid, while EfÏcient-
Det uses the more advanced BiFPN.
• Scaling method: EfÏcientDet introduces compound scaling for all
components of the network
4. Advantages of EfÏcientDet:
• Generally achieves better accuracy-efÏciency trade-offs than SSD
and many other object detection models.
• More flexible scaling allows for a family of models with different
size-performance trade-offs.
Object Detection 1449
The Goal
All Machine Learning projects need to start with a goal. Let’s assume we are in
an industrial facility and must sort and count wheels and special boxes.
python3 get_img_data.py
The Python script creates a web-based interface for capturing and organizing
image datasets using a Raspberry Pi and its camera. It’s handy for machine
learning projects that require labeled image data or not, as in our case here.
Access the web interface from a browser, enter a generic label for the images
you want to capture, and press Start Capture.
Note that the images to be captured will have multiple labels that
should be defined later.
Use the live preview to position the camera and click Capture Image to save
images under the current label (in this case, box-wheel.
Object Detection 1451
When we have enough images, we can press Stop Capture. The captured
images are saved on the folder dataset/box-wheel:
Labeling Data
The next step in an Object Detect project is to create a labeled dataset. We
should label the raw dataset images, creating bounding boxes around each
picture’s objects (box and wheel). We can use labeling tools like LabelImg,
CVAT, Roboflow, or even the Edge Impulse Studio. Once we have explored the
Edge Impulse tool in other labs, let’s use Roboflow here.
Object Detection Project 1452
We are using Roboflow (free version) here for two main reasons. 1)
We can have auto-labeler, and 2) The annotated dataset is available
in several formats and can be used both on Edge Impulse Studio
(we will use it for MobileNet V2 and FOMO train) and on CoLab
(YOLOv8 train), for example. Having the annotated dataset on Edge
Impulse (Free account), it is not possible to use it for training on
other platforms.
We should upload the raw dataset to Roboflow. Create a free account there
and start a new project, for example, (“box-versus-wheel”).
We will not enter in deep details about the Roboflow process once
many tutorials are available.
Annotate
Once the project is created and the dataset is uploaded, you should make the
annotations using the “Auto-Label” Tool. Note that you can also upload images
with only a background, which should be saved w/o any annotations.
Object Detection 1453
Once all images are annotated, you should split them into training, validation,
and testing.
Data Pre-Processing
The last step with the dataset is preprocessing to generate a final version for
training. Let’s resize all images to 320 × 320 and generate augmented versions
of each image (augmentation) to create new training examples from which our
model can learn.
For augmentation, we will rotate the images (+/-15o ), crop, and vary the
brightness and exposure.
Object Detection Project 1454
Now, you should export the annotated dataset in a format that Edge Impulse,
Ultralitics, and other frameworks/tools understand, for example, YOLOv8. Let’s
download a zipped version of the dataset to our desktop.
Object Detection 1455
There are 3 separate folders, one for each split (train/test/valid). For each
of them, there are 2 subfolders, images, and labels. The pictures are stored as
image_id.jpg and images_id.txt, where “image_id” is unique for every picture.
The labels file format will be class_id bounding box coordinates, where
in our case, class_id will be 0 for box and 1 for wheel. The numerical id (o, 1,
2…) will follow the alphabetical order of the class name.
The data.yaml file has info about the dataset as the classes’ names (names:
['box', 'wheel']) following the YOLO format.
Training an SSD MobileNet Model on Edge Impulse Studio 1456
And that’s it! We are ready to start training using the Edge Impulse Studio
(as we will do in the following step), Ultralytics (as we will when discussing
YOLO), or even training from scratch on CoLab (as we did with the Cifar-10
dataset on the Image Classification lab).
Here, you can clone the project developed for this hands-on lab:
Raspi - Object Detection.
On the Project Dashboard tab, go down and on Project info, and for Labeling
method select Bounding boxes (object detection)
Repeat the process for the test data (upload both folders, test, and validation).
At the end of the upload process, you should end with the annotated dataset of
153 images split in the train/test (84%/16%).
Object Detection 1457
Note that labels will be stored at the labels files 0 and 1 , which are
equivalent to box and wheel.
This choice will not interfere with the training; it will only give us
an idea about the latency of the model on that specific target.
The feature explorer shows that all samples evidence a good separation after
the feature generation.
For training, we should select a pre-trained model. Let’s use the MobileNetV2
SSD FPN-Lite (320x320 only) . It is a pre-trained object detection model de-
signed to locate up to 10 objects within an image, outputting a bounding box
for each object detected. The model is around 3.7 MB in size. It supports an
RGB input at 320 × 320 px.
Regarding the training hyper-parameters, the model will be trained with:
• Epochs: 25
• Batch size: 32
• Learning Rate: 0.15.
As a result, the model ends with an overall precision score (based on COCO
mAP) of 88.8%, higher than the result when using the test data (83.3%).
Transfer the model from your computer to the Raspi folder./models and
capture or get some images for inference and save them in the folder ./images.
import time
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
import tflite_runtime.interpreter as tflite
model_path = "./models/ei-raspi-object-detection-SSD-\
MobileNetv2-320x0320-int8.lite"
labels = ['box', 'wheel']
Remember that the model will output the class ID as values (0 and
1), following an alphabetic order regarding the class names.
Load the model, allocate the tensors, and get the input and output tensor
details:
One crucial difference to note is that the dtype of the input details of the
model is now int8, which means that the input values go from –128 to +127,
while each pixel of our raw image goes from 0 to 256. This means that we
should pre-process the image to match it. We can check here:
input_dtype = input_details[0]['dtype']
input_dtype
numpy.int8
So, let’s open the image and show it:
Training an SSD MobileNet Model on Edge Impulse Studio 1462
Checking the input data, we can verify that the input tensor is compatible
with what is expected by the model:
input_data.shape, input_data.dtype
Object Detection 1463
Now, it is time to perform the inference. Let’s also calculate the latency of
the model:
# Inference on Raspi-Zero
start_time = time.time()
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
end_time = time.time()
inference_time = (
(end_time - start_time)
* 1000 # Convert to milliseconds
)
print ("Inference time: {:.1f}ms".format(inference_time))
The model will take around 600ms to perform the inference in the Raspi-Zero,
which is around 5 times longer than a Raspi-5.
Now, we can get the output classes of objects detected, its bounding boxes
coordinates, and probabilities.
boxes = interpreter.get_tensor(output_details[1]['index'])[0]
classes = interpreter.get_tensor(output_details[3]['index'])[0]
scores = interpreter.get_tensor(output_details[0]['index'])[0]
num_detections = int(
interpreter.get_tensor(
output_details[2]['index']
)[0]
)
for i in range(num_detections):
if scores[i] > 0.5: # Confidence threshold
print(f"Object {i}:")
print(f" Bounding Box: {boxes[i]}")
print(f" Confidence: {scores[i]}")
print(f" Class: {classes[i]}")
Training an SSD MobileNet Model on Edge Impulse Studio 1464
From the results, we can see that 4 objects were detected: two with class ID 0
(box)and two with class ID 1 (wheel), what is correct!
threshold = 0.5
plt.figure(figsize=(6,6))
plt.imshow(orig_img)
for i in range(num_detections):
if scores[i] > threshold:
ymin, xmin, ymax, xmax = boxes[i]
(left, right, top, bottom) = (xmin * orig_img.width,
xmax * orig_img.width,
ymin * orig_img.height,
ymax * orig_img.height)
rect = plt.Rectangle((left, top), right-left, bottom-top,
fill=False, color='red', linewidth=2)
plt.gca().add_patch(rect)
class_id = int(classes[i])
class_name = labels[class_id]
plt.text(left, top-10, f'{class_name}: {scores[i]:.2f}',
color='red', fontsize=12, backgroundcolor='white')
Object Detection 1465
We start to see false positives and multiple detections, where the model
detects the same object multiple times with different confidence levels and
slightly different bounding boxes.
Commonly, sometimes, we need to adjust the threshold to smaller values
to capture all objects, avoiding false negatives, which would lead to multiple
detections.
To improve the detection results, we should implement Non-Maximum
Suppression (NMS), which helps eliminate overlapping bounding boxes and
keeps only the most confident detection.
For that, let’s create a general function named non_max_suppression(), with
the role of refining object detection results by eliminating redundant and over-
lapping bounding boxes. It achieves this by iteratively selecting the detection
with the highest confidence score and removing other significantly overlapping
detections based on an Intersection over Union (IoU) threshold.
keep = []
while order.size > 0:
i = order[0]
keep.append(i)
xx1 = np.maximum(x1[i], x1[order[1:]])
yy1 = np.maximum(y1[i], y1[order[1:]])
xx2 = np.minimum(x2[i], x2[order[1:]])
yy2 = np.minimum(y2[i], y2[order[1:]])
return keep
How it works:
1. Sorting: It starts by sorting all detections by their confidence scores, high-
est to lowest.
Object Detection 1467
2. Selection: It selects the highest-scoring box and adds it to the final list of
detections.
3. Comparison: This selected box is compared with all remaining lower-
scoring boxes.
4. Elimination: Any box that overlaps significantly (above the IoU threshold)
with the selected box is eliminated.
5. Iteration: This process repeats with the next highest-scoring box until all
boxes are processed.
Now, we can define a more precise visualization function that will take into
consideration an IoU threshold, detecting only the objects that were selected
by the non_max_suppression function:
ax.add_patch(rect)
class_name = labels[int(classes[i])]
ax.text(xmin * width, ymin * height - 10,
f'{class_name}: {scores[i]:.2f}', color='red',
fontsize=12, backgroundcolor='white')
plt.show()
Now we can create a function that will call the others, performing inference
on any image:
Training an SSD MobileNet Model on Edge Impulse Studio 1468
# Inference on Raspi-Zero
start_time = time.time()
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
end_time = time.time()
inference_time = (
end_time - start_time
) * 1000 # Convert to milliseconds
Now, running the code, having the same image again with a confidence
threshold of 0.3, but with a small IoU:
img_path = "./images/box_2_wheel_2.jpg"
detect_objects(img_path, conf=0.3,iou=0.05)
Object Detection 1469
The inference with the SSD MobileNet model worked well, but the latency was
significantly high. The inference varied from 0.5 to 1.3 seconds on a Raspi-Zero,
which means around or less than 1 FPS (1 frame per second). One alternative
to speed up the process is to use FOMO (Faster Objects, More Objects).
This novel machine learning algorithm lets us count multiple objects and find
their location in an image in real-time using up to 30× less processing power
and memory than MobileNet SSD or YOLO. The main reason this is possible is
that while other models calculate the object’s size by drawing a square around
it (bounding box), FOMO ignores the size of the image, providing only the
information about where the object is located in the image through its centroid
coordinates.
In a typical object detection pipeline, the first stage is extracting features from
the input image. FOMO leverages MobileNetV2 to perform this task. Mo-
bileNetV2 processes the input image to produce a feature map that captures
essential characteristics, such as textures, shapes, and object edges, in a compu-
tationally efÏcient way.
Training a FOMO Model at Edge Impulse Studio 1470
FOMO divides the image into blocks of pixels using a factor of 8. For the
input of 96 × 96, the grid would be 12 × 12 (96/8 = 12). For a 160 × 160, the
grid will be 20 × 20, and so on. Next, FOMO will run a classifier through each
pixel block to calculate the probability that there is a box or a wheel in each of
them and, subsequently, determine the regions that have the highest probability
of containing the object (If a pixel block has no objects, it will be classified as
background). From the overlap of the final region, the FOMO provides the
coordinates (related to the image dimensions) of the centroid of this region.
Object Detection 1471
On the Image tab, generate the features and go to the Object detection tab.
We should select a pre-trained model for training. Let’s use the FOMO
(Faster Objects, More Objects) MobileNetV2 0.35.
Training a FOMO Model at Edge Impulse Studio 1472
As we did in the previous section, we can deploy the trained model as TFLite
or Linux (AARCH64). Let’s do it now as Linux (AARCH64), a binary that
implements the Edge Impulse Linux protocol.
Edge Impulse for Linux models is delivered in .eim format. This executable
contains our “full impulse” created in Edge Impulse Studio. The impulse
consists of the signal processing block(s) and any learning and anomaly block(s)
we added and trained. It is compiled with optimizations for our processor or
GPU (e.g., NEON instructions on ARM cores), plus a straightforward IPC layer
(over a Unix socket).
At the Deploy tab, select the option Linux (AARCH64), the int8model and
press Build.
Training a FOMO Model at Edge Impulse Studio 1474
cd ~
cd Documents
mkdir EI_Linux
cd EI_Linux
mkdir models
mkdir images
Let’s set up a Virtual Environment for working with the Linux Python SDK
chmod +x raspi-object-detection-linux-aarch64-FOMO-int8.eim
jupyter notebook
Let’s start a new notebook by following all the steps to detect cubes and
wheels on an image using the FOMO model and the Edge Impulse Linux
Python SDK.
Import the needed libraries:
model_file = "raspi-object-detection-linux-aarch64-int8.eim"
model_path = "models/"+ model_file # Trained ML model from
# Edge Impulse
labels = ['box', 'wheel']
Remember that the model will output the class ID as values (0 and
1), following an alphabetic order regarding the class names.
# Initialize model
model_info = runner.init()
The model_info will contain critical information about our model. However,
unlike the TFLite interpreter, the EI Linux Python SDK library will now prepare
the model for inference.
So, let’s open the image and show it (Now, for compatibility, we will use
OpenCV, the CV Library used internally by EI. OpenCV reads the image as
BGR, so we will need to convert it to RGB :
Now we will get the features and the preprocessed image (cropped) using
the runner:
And perform the inference. Let’s also calculate the latency of the model:
res = runner.classify(features)
Let’s get the output classes of objects detected, their bounding boxes centroids,
and probabilities.
The results show that two objects were detected: one with class ID 0 (box)
and one with class ID 1 (wheel), which is correct!
Let’s visualize the result (The threshold is 0.5, the default value set during
the model testing on the Edge Impulse Studio).
Training a FOMO Model at Edge Impulse Studio 1478
Key Features:
1. Single Network Architecture:
• YOLO employs a single neural network to process the entire image.
This network divides the image into a grid and, for each grid cell,
directly predicts bounding boxes and associated class probabilities.
This end-to-end training improves speed and simplifies the model
architecture.
2. Real-Time Processing:
• One of YOLO’s standout features is its ability to perform object
detection in real-time. Depending on the version and hardware,
YOLO can process images at high frames per second (FPS). This
makes it ideal for applications requiring quick and accurate object
detection, such as video surveillance, autonomous driving, and live
sports analysis.
3. Evolution of Versions:
• Over the years, YOLO has undergone significant improvements, from
YOLOv1 to the latest YOLOv10. Each iteration has introduced en-
hancements in accuracy, speed, and efÏciency. YOLOv8, for instance,
incorporates advancements in network architecture, improved train-
ing methodologies, and better support for various hardware, ensur-
ing a more robust performance.
• Although YOLOv10 is the family’s newest member with an encour-
aging performance based on its paper, it was just released (May 2024)
and is not fully integrated with the Ultralitycs library. Conversely,
the precision-recall curve analysis suggests that YOLOv8 generally
outperforms YOLOv9, capturing a higher proportion of true pos-
itives while minimizing false positives more effectively (for more
details, see this article). So, this lab is based on the YOLOv8n.
Exploring a YOLO Model using Ultralitics 1480
• While early versions of YOLO traded off some accuracy for speed,
recent versions have made substantial strides in balancing both. The
newer models are faster and more accurate, detecting small objects
(such as bees) and performing well on complex datasets.
Installation
On our Raspi, let’s deactivate the current environment to create a new working
area:
deactivate
cd ~
cd Documents/
mkdir YOLO
cd YOLO
mkdir models
mkdir images
Let’s set up a Virtual Environment for working with the Ultralytics YOLOv8
And install the Ultralytics packages for local inference on the Raspi
1. Update the packages list, install pip, and upgrade to the latest:
sudo reboot
source ~/yolo/bin/activate
cd /Documents/YOLO
and run inference on an image that will be downloaded from the Ultralytics
website, using the YOLOV8n model (the smallest in the family) at the Terminal
(CLI):
The inference result will appear in the terminal. In the image (bus.jpg), 4
persons, 1 bus, and 1 stop signal were detected:
So, the Ultrayitics YOLO is correctly installed on our Raspi. But, on the
Raspi-Zero, an issue is the high latency for this inference, around 18 seconds,
even with the most miniature model of the family (YOLOv8n).
NCNN delivers the best inference performance when working with Rasp-
berry Pi devices. NCNN is highly optimized for mobile embedded platforms
(such as ARM architecture).
So, let’s convert our model and rerun the inference:
1. Export a YOLOv8n PyTorch model to NCNN format, creating: ‘/yolov8n_-
ncnn_model’
2. Run inference with the exported model (now the source could be the
bus.jpg image that was downloaded from the website to the current
directory on the last inference):
The first inference, when the model is loaded, usually has a high
latency (around 17s), but from the 2nd, it is possible to note that the
inference goes down to around 2s.
python3
Now, we should call the YOLO library from Ultralitics and load the model:
img = 'bus.jpg'
result = model.predict(img, save=True, imgsz=640, conf=0.5,
iou=0.3)
Object Detection 1485
We can verify that the result is almost identical to the one we get running the
inference at the terminal level (CLI), except that the bus stop was not detected
with the reduced NCNN model. Note that the latency was reduced.
Let’s analyze the “result” content.
For example, we can see result[0].boxes.data, showing us the main in-
ference result, which is a tensor shape (4, 6). Each line is one of the objects
detected, being the 4 first columns, the bounding boxes coordinates, the 5th,
the confidence, and the 6th, the class (in this case, 0: person and 5: bus):
We can access several inference results separately, as the inference time, and
have it printed in a better format:
inference_time = int(result[0].speed['inference'])
print(f"Inference Time: {inference_time} ms")
With Python, we can create a detailed output that meets our needs (See Model
Prediction with Ultralytics YOLO for more details). Let’s run a Python script
instead of manually entering it line by line in the interpreter, as shown below.
Let’s use nano as our text editor. First, we should create an empty Python script
named, for example, yolov8_tests.py:
nano yolov8_tests.py
# Run inference
img = 'bus.jpg'
result = model.predict(img, save=False, imgsz=640,
conf=0.5, iou=0.3)
And enter with the commands: [CTRL+O] + [ENTER] +[CTRL+X] to save the
Python script.
Run the script:
python yolov8_tests.py
The result is the same as running the inference at the terminal level (CLI)
and with the built-in Python interpreter.
Calling the YOLO library and loading the model for inference for
the first time takes a long time, but the inferences after that will be
much faster. For example, the first single inference can take several
seconds, but after that, the inference time should be reduced to less
than 1 second.
Object Detection 1487
For training, let’s adapt one of the public examples available from Ultralitytics
and run it on Google Colab. Below, you can find mine to be adapted in your
project:
3. Now, you can import the YOLO and upload your dataset to the CoLab,
pasting the Download code that we get from Roboflow. Note that our
dataset will be mounted under /content/datasets/:
4. It is essential to verify and change the file data.yaml with the correct
path for the images (copy the path on each images folder).
names:
- box
- wheel
nc: 2
roboflow:
license: CC BY 4.0
project: box-versus-wheel-auto-dataset
url: https://universe.roboflow.com/marcelo-rovai-riila/ \
box-versus-wheel-auto-dataset/dataset/5
version: 5
workspace: marcelo-rovai-riila
test: /content/datasets/Box-versus-Wheel-auto-dataset-5/ \
test/images
train: /content/datasets/Box-versus-Wheel-auto-dataset-5/ \
train/images
val: /content/datasets/Box-versus-Wheel-auto-dataset-5/ \
valid/images
Object Detection 1489
5. Define the main hyperparameters that you want to change from default,
for example:
MODEL = 'yolov8n.pt'
IMG_SIZE = 640
EPOCHS = 25 # For a final project, you should consider
# at least 100 epochs
Figure 20.21:
image-20240910111319804
The model took a few minutes to be trained and has an excellent result
(mAP50 of 0.995). At the end of the training, all results are saved in the folder
listed, for example: /runs/detect/train/. There, you can find, for example,
the confusion matrix.
Exploring a YOLO Model using Ultralitics 1490
7. Note that the trained model (best.pt) is saved in the folder /runs/detect/train/weights/.
Now, you should validate the trained model with the valid/images.
The inference results are saved in the folder runs/detect/predict. Let’s see
some of them:
Object Detection 1491
9. It is advised to export the train, validation, and test results for a Drive at
Google. To do so, we should mount the drive.
from google.colab import drive
drive.mount('/content/gdrive')
and copy the content of /runs folder to a folder that you should create in
your Drive, for example:
!scp -r /content/runs '/content/gdrive/MyDrive/\
10_UNIFEI/Box_vs_Wheel_Project'
cd ..
python
As before, we will import the YOLO library and define our converted model
to detect bees:
Now, let’s define an image and call the inference (we will save the image
result this time to external verification):
Object Detection on a live stream 1492
img = './images/1_box_1_wheel.jpg'
result = model.predict(img, save=True, imgsz=320,
conf=0.5, iou=0.3)
Let’s repeat for several images. The inference result is saved on the variable
result, and the processed image on runs/detect/predict8
Using FileZilla FTP, we can send the inference result to our Desktop for
verification:
We can see that the inference result is excellent! The model was trained based
on the smaller base model of the YOLOv8 family (YOLOv8n). The issue is the
latency, around 1 second (or 1 FPS on the Raspi-Zero). Of course, we can reduce
this latency and convert the model to TFLite or NCNN.
model_path = "./models/ssd-mobilenet-v1-tflite-default-v1.tflite"
python3 object_detection_app.py
Let’s see a technical description of the key modules used in the object detec-
tion application:
1. TensorFlow Lite (tflite_runtime):
• Purpose: EfÏcient inference of machine learning models on edge
devices.
• Why: TFLite offers reduced model size and optimized performance
compared to full TensorFlow, which is crucial for resource-constrained
devices like Raspberry Pi. It supports hardware acceleration and
quantization, further improving efÏciency.
• Key functions: Interpreter for loading and running the model,
get_input_details(), and get_output_details() for interfacing
with the model.
2. Flask:
• Purpose: Lightweight web framework for creating the backend
server.
• Why: Flask’s simplicity and flexibility make it ideal for rapidly devel-
oping and deploying web applications. It’s less resource-intensive
than larger frameworks suitable for edge devices.
• Key components: route decorators for defining API endpoints, Response
objects for streaming video, render_template_string for serving
dynamic HTML.
3. Picamera2:
• Purpose: Interface with the Raspberry Pi camera module.
• Why: Picamera2 is the latest library for controlling Raspberry Pi cam-
eras, offering improved performance and features over the original
Picamera library.
• Key functions: create_preview_configuration() for setting up
the camera, capture_file() for capturing frames.
4. PIL (Python Imaging Library):
• Purpose: Image processing and manipulation.
• Why: PIL provides a wide range of image processing capabilities.
It’s used here to resize images, draw bounding boxes, and convert
between image formats.
• Key classes: Image for loading and manipulating images, ImageDraw
for drawing shapes and text on images.
5. NumPy:
• Purpose: EfÏcient array operations and numerical computing.
• Why: NumPy’s array operations are much faster than pure Python
lists, which is crucial for efÏciently processing image data and model
inputs/outputs.
• Key functions: array() for creating arrays, expand_dims() for adding
dimensions to arrays.
Object Detection 1495
6. Threading:
• Purpose: Concurrent execution of tasks.
• Why: Threading allows simultaneous frame capture, object detec-
tion, and web server operation, crucial for maintaining real-time
performance.
• Key components: Thread class creates separate execution threads,
and Lock is used for thread synchronization.
7. io.BytesIO:
• Purpose: In-memory binary streams.
• Why: Allows efÏcient handling of image data in memory without
needing temporary files, improving speed and reducing I/O opera-
tions.
8. time:
• Purpose: Time-related functions.
• Why: Used for adding delays (time.sleep()) to control frame rate
and for performance measurements.
9. jQuery (client-side):
• Purpose: Simplified DOM manipulation and AJAX requests.
• Why: It makes it easy to update the web interface dynamically and
communicate with the server without page reloads.
• Key functions: .get() and .post() for AJAX requests, DOM ma-
nipulation methods for updating the UI.
enable the system to process video frames in real-time, while Flask and jQuery
provide a user-friendly way to interact with them.
You can test the app with another pre-processed model, such as the EfÏcient-
Det, changing the app line:
model_path = "./models/lite-model_efficientdet_lite0_\
detection_metadata_1.tflite"
Conclusion
This lab has explored the implementation of object detection on edge devices like
the Raspberry Pi, demonstrating the power and potential of running advanced
computer vision tasks on resource-constrained hardware. We’ve covered several
vital aspects:
1. Model Comparison: We examined different object detection models,
including SSD-MobileNet, EfÏcientDet, FOMO, and YOLO, comparing
their performance and trade-offs on edge devices.
2. Training and Deployment: Using a custom dataset of boxes and wheels
(labeled on Roboflow), we walked through the process of training mod-
els using Edge Impulse Studio and Ultralytics and deploying them on
Raspberry Pi.
3. Optimization Techniques: To improve inference speed on edge devices,
we explored various optimization methods, such as model quantization
(TFLite int8) and format conversion (e.g., to NCNN).
4. Real-time Applications: The lab exemplified a real-time object detection
web application, demonstrating how these models can be integrated into
practical, interactive systems.
5. Performance Considerations: Throughout the lab, we discussed the bal-
ance between model accuracy and inference speed, a critical consideration
for edge AI applications.
The ability to perform object detection on edge devices opens up numerous
possibilities across various domains, from precision agriculture, industrial
automation, and quality control to smart home applications and environmental
monitoring. By processing data locally, these systems can offer reduced latency,
improved privacy, and operation in environments with limited connectivity.
Looking ahead, potential areas for further exploration include:
• Implementing multi-model pipelines for more complex tasks
• Exploring hardware acceleration options for Raspberry Pi
• Integrating object detection with other sensors for more comprehensive
edge AI systems
Object Detection 1497
Resources
• Dataset (“Box versus Wheel”)
• SSD-MobileNet Notebook on a Raspi
• EfÏcientDet Notebook on a Raspi
• FOMO - EI Linux Notebook on a Raspi
• YOLOv8 Box versus Wheel Dataset Training on CoLab
• Edge Impulse Project - SSD MobileNet and FOMO
• Python Scripts
• Models
Small Language Models (SLM)
Overview
In the fast-growing area of artificial intelligence, edge computing presents an
opportunity to decentralize capabilities traditionally reserved for powerful,
centralized servers. This lab explores the practical integration of small versions
1499
Setup 1500
Setup
We could use any Raspi model in the previous labs, but here, the choice must be
the Raspberry Pi 5 (Raspi-5). It is a robust platform that substantially upgrades
the last version 4, equipped with the Broadcom BCM2712, a 2.4 GHz quad-core
64-bit Arm Cortex-A76 CPU featuring Cryptographic Extension and enhanced
caching capabilities. It boasts a VideoCore VII GPU, dual 4Kp60 HDMI®
outputs with HDR, and a 4Kp60 HEVC decoder. Memory options include 4
GB and 8 GB of high-speed LPDDR4X SDRAM, with 8GB being our choice to
run SLMs. It also features expandable storage via a microSD card slot and a
PCIe 2.0 interface for fast peripherals such as M.2 SSDs (Solid State Drives).
For real SSL applications, SSDs are a better option than SD cards.
By the way, as Alasdair Allan discussed, inferencing directly on the Raspberry
Pi 5 CPU—with no GPU acceleration—is now on par with the performance of
the Coral TPU.
Small Language Models (SLM) 1501
For more info, please see the complete article: Benchmarking TensorFlow
and TensorFlow Lite on Raspberry Pi 5.
The Active Cooler has pre-applied thermal pads for heat transfer and is
mounted directly to the Raspberry Pi 5 board using spring-loaded push pins.
The Raspberry Pi firmware actively manages it: at 60°C, the blower’s fan will
be turned on; at 67.5°C, the fan speed will be increased; and finally, at 75°C,
the fan increases to full speed. The blower’s fan will spin down automatically
when the temperature drops below these limits.
Generative AI (GenAI) 1502
Generative AI (GenAI)
Generative AI is an artificial intelligence system capable of creating new, original
content across various mediums such as text, images, audio, and video. These
systems learn patterns from existing data and use that knowledge to generate
novel outputs that didn’t previously exist. Large Language Models (LLMs),
Small Language Models (SLMs), and multimodal models can all be considered
types of GenAI when used for generative tasks.
GenAI provides the conceptual framework for AI-driven content creation,
with LLMs serving as powerful general-purpose text generators. SLMs adapt
this technology for edge computing, while multimodal models extend GenAI
capabilities across different data types. Together, they represent a spectrum of
generative AI technologies, each with its strengths and applications, collectively
driving AI-powered content creation and understanding.
SLMs achieve their compact size through various techniques such as knowl-
edge distillation, model pruning, and quantization. While they may not match
the broad capabilities of larger models, SLMs excel in specific tasks and do-
mains, making them ideal for targeted applications on edge devices.
Ollama
Installing Ollama
Let’s set up and activate a Virtual Environment for working with Ollama:
ollama -v
Small Language Models (SLM) 1507
On the Ollama Library page, we can find the models Ollama supports. For
example, by filtering by Most popular, we can see Meta Llama, Google Gemma,
Microsoft Phi, LLaVa, etc.
Let’s install and run our first small language model, Llama 3.2 1B (and 3B).
The Meta Llama 3.2 series comprises a set of multilingual generative language
models available in 1 billion and 3 billion parameter sizes. These models are
designed to process text input and generate text output. The instruction-tuned
variants within this collection are specifically optimized for multilingual con-
versational applications, including tasks involving information retrieval and
summarization with an agentic approach. When compared to many existing
open-source and proprietary chat models, the Llama 3.2 instruction-tuned mod-
els demonstrate superior performance on widely-used industry benchmarks.
The 1B and 3B models were pruned from the Llama 8B, and then logits from
the 8B and 70B models were used as token-level targets (token-level distillation).
Knowledge distillation was used to recover performance (they were trained
with 9 trillion tokens). The 1B model has 1,24B, quantized to integer (Q8_0),
and the 3B, 3.12B parameters, with a Q4_0 quantization, which ends with a size
of 1.3 GB and 2 GB, respectively. Its context window is 131,072 tokens.
Running the model with the command before, we should have the Ollama
prompt available for us to input a question and start chatting with the LLM
model; for example,
>>> What is the capital of France?
Almost immediately, we get the correct answer:
The capital of France is Paris.
Using the option --verbose when calling the model will generate several
statistics about its performance (The model will be polling only the first time
we run the command).
Each metric gives insights into how the model processes inputs and generates
outputs. Here’s a breakdown of what each metric means:
• Total Duration (2.620170326 s): This is the complete time taken from the
start of the command to the completion of the response. It encompasses
loading the model, processing the input prompt, and generating the
response.
• Load Duration (39.947908 ms): This duration indicates the time to load
the model or necessary components into memory. If this value is minimal,
it can suggest that the model was preloaded or that only a minimal setup
was required.
• Prompt Eval Count (32 tokens): The number of tokens in the input
prompt. In NLP, tokens are typically words or subwords, so this count in-
cludes all the tokens that the model evaluated to understand and respond
to the query.
Small Language Models (SLM) 1509
• Prompt Eval Duration (1.644773 s): This measures the model’s time to
evaluate or process the input prompt. It accounts for the bulk of the
total duration, implying that understanding the query and preparing a
response is the most time-consuming part of the process.
• Prompt Eval Rate (19.46 tokens/s): This rate indicates how quickly the
model processes tokens from the input prompt. It reflects the model’s
speed in terms of natural language comprehension.
• Eval Count (8 token(s)): This is the number of tokens in the model’s
response, which in this case was, “The capital of France is Paris.”
• Eval Duration (889.941 ms): This is the time taken to generate the out-
put based on the evaluated input. It’s much shorter than the prompt
evaluation, suggesting that generating the response is less complex or
computationally intensive than understanding the prompt.
• Eval Rate (8.99 tokens/s): Similar to the prompt eval rate, this indicates
the speed at which the model generates output tokens. It’s a crucial metric
for understanding the model’s efÏciency in output generation.
The eval rate is lower, 5.3 tokens/s versus 9 tokens/s with the smaller model.
When question about
>>> What is the distance between Paris and Santiago, Chile?
The 1B model answered 9,841 kilometers (6,093 miles), which is inac-
curate, and the 3B model answered 7,300 miles (11,700 km), which is close
to the correct (11,642 km).
Let’s ask for the Paris’s coordinates:
>>> what is the latitude and longitude of Paris?
Ollama 1510
Google Gemma 2 2B
Running the model with the command before, we should have the Ollama
prompt available for us to input a question and start chatting with the LLM
model; for example,
>>> What is the capital of France?
Almost immediately, we get the correct answer:
The capital of France is **Paris**. �
And it’ statistics.
Small Language Models (SLM) 1511
We can see that Gemma 2:2B has around the same performance as Llama
3.2:3B, but having less parameters.
Other examples:
You got it! Here are the latitudes and longitudes of Paris,
France:
A good and accurate answer (a little more verbose than the Llama answers).
The model size, in terms of bytes, will depend on the specific quantization
format used. The size can go from 2-bit quantization (q2_k) of 1.4 GB (higher
performance/lower quality) to 16-bit quantization (fp-16) of 7.6 GB (lower
performance/higher quality).
Let’s run the 4-bit quantization (Q4_0), which will need 2.2 GB of RAM, with
an intermediary trade-off regarding output quality and performance.
You can use run or pull to download the model. What happens is
that Ollama keeps note of the pulled models, and once the PHI3
does not exist, before running it, Ollama pulls it.
...
In this case, the answer was still longer than we expected, with an eval rate
of 2.25 tokens/s, more than double that of Gemma and Llama.
The best model to use is the one fit for your specific necessity. Also,
take into consideration that this field evolves with new models
everyday.
Multimodal Models
Multimodal models are artificial intelligence (AI) systems that can process and
understand information from multiple sources, such as images, text, audio, and
video. In our context, multimodal LLMs can process various inputs, including
text, images, and audio, as prompts and convert those prompts into various
outputs, not just the source type.
We will work here with LLaVA-Phi-3, a fine-tuned LLaVA model from Phi
3 Mini 4k. It has strong performance benchmarks that are on par with the
original LLaVA (Large Language and Vision Assistant) model.
The LLaVA-Phi-3 is an end-to-end trained large multimodal model designed
to understand and generate content based on visual inputs (images) and textual
instructions. It combines the capabilities of a visual encoder and a language
model to process and respond to multimodal inputs.
Let’s install the model:
The response took around 30 s, with an eval rate of 3.93 tokens/s! Not bad!
But let us know to enter with an image as input. For that, let’s create a
directory for working:
cd Documents/
mkdir OLLAMA
cd OLLAMA
Let’s download a 640 × 320 image from the internet, for example (Wikipedia:
Paris, France):
Using FileZilla, for example, let’s upload the image to the OLLAMA folder at
the Raspi-5 and name it image_test_1.jpg. We should have the whole image
path (we can use pwd to get it).
/home/mjrovai/Documents/OLLAMA/image_test_1.jpg
If you use a desktop, you can copy the image path by clicking the image with
the mouse’s right button.
Small Language Models (SLM) 1515
The result was great, but the overall latency was significant; almost 4 minutes
to perform the inference.
htop
During the time that the model is running, we can inspect the resources:
All four CPUs run at almost 100% of their capacity, and the memory used
with the model loaded is 3.24 GB. Exiting Ollama, the memory goes down to
around 377 MB (with no desktop).
It is also essential to monitor the temperature. When running the Raspberry
with a desktop, you can have the temperature shown on the taskbar:
If you are “headless”, the temperature can be monitored with the command:
vcgencmd measure_temp
If you are doing nothing, the temperature is around 50°C for CPUs running
at 1%. During inference, with the CPUs at 100%, the temperature can rise to
almost 70°C. This is OK and means the active cooler is working, keeping the
temperature below 80°C / 85°C (its limit).
Installation:
In the terminal, run the command:
We will need a text editor or an IDE to create a Python script. If you run the
Raspberry OS on a desktop, several options, such as Thonny and Geany, have
already been installed by default (accessed by [Menu][Programming]). You can
download other IDEs, such as Visual Studio Code, from [Menu][Recommended
Software]. When the window pops up, go to [Programming], select the option
of your choice, and press [Apply].
To run Jupyter Notebook, run the command (change the IP address for yours):
On the terminal, you can see the local URL address to open the notebook:
Ollama Python Library 1518
import ollama
ollama.list()
{'name': 'gemma2:2b',
'model': 'gemma2:2b',
'modified_at': '2024-09-24T19:30:40.053898094+01:00',
'size': 1629518495,
'digest': (
'8ccf136fdd5298f3ffe2d69862750ea7fb56555fa4d5b18c0'
'4e3fa4d82ee09d7'
),
Let’s repeat one of the questions that we did before, but now using ollama.generate()
from Ollama python library. This API will generate a response for the given
prompt with the provided model. This is a streaming endpoint, so there will
be a series of responses. The final response object will include statistics and
additional data from the request.
Small Language Models (SLM) 1519
MODEL = 'gemma2:2b'
PROMPT = 'What is the capital of France?'
In case you are running the code as a Python script, you should save it, for
example, test_ollama.py. You can use the IDE to run it or do it directly on the
terminal. Also, remember that you should always call the model and define it
when running a stand-alone script.
python test_ollama.py
{
'model': 'gemma2:2b',
'created_at': '2024-09-25T14:43:31.869633807Z',
'response': 'The capital of France is **Paris**.\n',
'done': True,
'done_reason': 'stop',
'context': [
106, 1645, 108, 1841, 603, 573, 6037, 576, 6081, 235336,
107, 108, 106, 2516, 108, 651, 6037, 576, 6081, 603, 5231,
29437, 168428, 235248, 244304, 241035, 235248, 108
],
'total_duration': 24259469458,
'load_duration': 19830013859,
'prompt_eval_count': 16,
'prompt_eval_duration': 1908757000,
'eval_count': 14,
'eval_duration': 2475410000
}
print(f"\n{res['response']}")
print(
f"\n [INFO] Total Duration: "
f"{res['total_duration']/1e9:.2f} seconds"
)
Now, we got:
Using Ollama.chat()
Another way to get our response is to use ollama.chat(), which generates
the next message in a chat with a provided model. This is a streaming endpoint,
so a series of responses will occur. Streaming can be disabled using "stream":
false. The final response object will also include statistics and additional data
from the request.
In the above code, we are running two queries, and the second prompt
considers the result of the first one.
Here is how the model responded:
MODEL = 'llava-phi3:3.8b'
PROMPT = "Describe this picture"
response = ollama.generate(
model=MODEL,
prompt=PROMPT,
images= [img]
)
Ollama Python Library 1522
print(f"\n{response['response']}")
print(f"\n [INFO] Total Duration: "
f"{(res['total_duration']/1e9):.2f} seconds")
The model took about 4 minutes (256.45 s) to return with a detailed image
description.
Function Calling
So far, we can observe that by using the model’s response into a variable, we
can effectively incorporate it into real-world projects. However, a major issue
arises when the model provides varying responses to the same input. For
instance, let’s assume that we only need the name of a country’s capital and
its coordinates as the model’s response in the previous examples, without any
Small Language Models (SLM) 1523
additional information, even when utilizing verbose models like Microsoft Phi.
To ensure consistent responses, we can employ the ‘Ollama function call,’ which
is fully compatible with the OpenAI API.
Once the user enters a country name, the model will return the name of its
capital city (as a string) and the latitude and longitude of such city (in float).
Using those coordinates, we can use a simple Python library (haversine) to
calculate the distance between those 2 points.
The idea of this project is to demonstrate a combination of language model
interaction, structured data handling with Pydantic, and geospatial calculations
using the Haversine formula (traditional computing).
First, let us install some libraries. Besides Haversine, the main one is the Ope-
nAI Python library, which provides convenient access to the OpenAI REST API
from any Python 3.7+ application. The other one is Pydantic (and instructor), a
robust data validation and settings management library engineered by Python
to enhance the robustness and reliability of our codebase. In short, Pydantic
will help ensure that our model’s response will always be consistent.
Ollama Python Library 1524
Now, we should create a Python script designed to interact with our model
(LLM) to determine the coordinates of a country’s capital city and calculate the
distance from Santiago de Chile to that capital.
Let’s go over the code:
1. Importing Libraries
import sys
from haversine import haversine
from openai import OpenAI
from pydantic import BaseModel, Field
import instructor
class CityCoord(BaseModel):
city: str = Field(
...,
description="Name of the city"
)
lat: float = Field(
...,
description="Decimal Latitude of the city"
)
lon: float = Field(
...,
description="Decimal Longitude of the city"
)
client = instructor.patch(
OpenAI(
base_url="http://localhost:11434/v1", # Local API base
# URL (Ollama)
api_key="ollama", # API key
# (not used)
),
mode=instructor.Mode.JSON, # Mode for
# structured
# JSON output
)
• OpenAI: This setup initializes an OpenAI client with a local base URL
and an API key (ollama). It uses a local server.
• instructor.patch: Patches the OpenAI client to work in JSON mode, en-
abling structured output that matches the Pydantic model.
resp = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "user",
"content": f"return the decimal latitude and \
decimal longitude of the capital of the {country}."
}
],
response_model=CityCoord,
max_retries=10
)
distance = haversine(
(mylat, mylon),
(resp.lat, resp.lon),
unit='km'
)
print(
f"Santiago de Chile is about {int(round(distance, -1))} "
f"kilometers away from {resp.city}."
)
If we enter different countries, for example, France, Colombia, and the United
States, We can note that we always receive the same structured information:
If you run the code as a script, the result will be printed on the terminal:
Adding images
Now it is time to wrap up everything so far! Let’s modify the script so that
instead of entering the country name (as a text), the user enters an image, and
the application (based on SLM) returns the city in the image and its geographic
location. With those data, we can calculate the distance as before.
Ollama Python Library 1528
For simplicity, we will implement this new code in two steps. First, the LLM
will analyze the image and create a description (text). This text will be passed
on to another instance, where the model will extract the information needed to
pass along.
We will start importing the libraries
import sys
import time
from haversine import haversine
import ollama
from openai import OpenAI
from pydantic import BaseModel, Field
import instructor
We can see the image if you run the code on the Jupyter Notebook. For that
we need also import:
MODEL = 'llava-phi3:3.8b'
mylat = -33.33
mylon = -70.51
We can download a new image, for example, Machu Picchu from Wikipedia.
On the Notebook we can see it:
Now, let’s define a function that will receive the image and will return the
decimal latitude and decimal longitude of the city in the image, its
name, and what country it is located
def image_description(img_path):
with open(img_path, 'rb') as file:
response = ollama.chat(
model=MODEL,
messages=[
{
'role': 'user',
'content': '''return the decimal latitude and \
decimal longitude of the city in the image, \
its name, and what country it is located''',
'images': [file.read()],
},
],
options = {
'temperature': 0,
}
)
#print(response['message']['content'])
return response['message']['content']
The image description generated for the function will be passed as a prompt
for the model again.
class CityCoord(BaseModel):
city: str = Field(
...,
description="Name of the city in the image"
)
country: str = Field(
...,
description=(
"Name of the country where "
"the city in the image is located"
)
)
lat: float = Field(
...,
description=(
"Decimal latitude of the city in "
"the image"
)
)
lon: float = Field(
...,
description=(
"Decimal longitude of the city in "
"the image"
)
)
image_description = image_description(img_path)
# Send this description to the model
resp = client.chat.completions.create(
model=MODEL,
messages=[
Small Language Models (SLM) 1531
{
"role": "user",
"content": image_description,
}
],
response_model=CityCoord,
max_retries=10,
temperature=0,
)
And the second response from the model (resp) will be:
distance = haversine(
(mylat, mylon),
(resp.lat, resp.lon),
unit='km'
)
print((
f"\nThe image shows {resp.city}, with lat: "
f"{round(resp.lat, 2)} and long: "
f"{round(resp.lon, 2)}, located in "
f"{resp.country} and about "
f"{int(round(distance, -1)):,} kilometers "
f"away from Santiago, Chile.\n"
))
print(
f"[INFO] ==> The code (running {MODEL}), "
f"took {elapsed_time:.1f} seconds to execute.\n"
)
print(
f"[INFO] ==> The code (running {MODEL}), "
f"took {elapsed_time:.1f} seconds "
f"to execute.\n"
)
python calc_distance_image.py \
/home/mjrovai/Documents/OLLAMA/image_test_3.jpg
Enter with the Machu Picchu image full patch as an argument. We will get
the same previous result.
Of course, there are many ways to optimize the code used here. Still, the
idea is to explore the considerable potential of function calling with SLMs at
the edge, allowing those models to integrate with external functions or APIs.
Going beyond text generation, SLMs can access real-time data, automate tasks,
and interact with various systems.
RAG Implementation
In a basic interaction between a user and a language model, the user asks
a question, which is sent as a prompt to the model. The model generates a
response based solely on its pre-trained knowledge. In a RAG process, there’s
an additional step between the user’s question and the model’s response. The
user’s question triggers a retrieval process from a knowledge base.
cd Documents/OLLAMA/
mkdir RAG-simple-bee
cd RAG-simple-bee/
import ollama
import chromadb
import time
EMB_MODEL = "nomic-embed-text"
MODEL = 'llama3.2:3B'
Initially, a knowledge base about bee facts should be created. This involves
collecting relevant documents and converting them into vector embeddings.
These embeddings are then stored in a vector database, allowing for efÏcient
similarity searches later. Enter with the “document,” a base of “bee facts” as a
list:
RAG Implementation 1536
documents = [
"Bee-keeping, also known as apiculture, involves the \
maintenance of bee colonies, typically in hives, by humans.",
"The most commonly kept species of bees is the European \
honey bee (Apis mellifera).",
...
Now, we will create our vector embedding database bee_facts and store the
document in it:
client = chromadb.Client()
collection = client.create_collection(name="bee_facts")
Now that we have our “Knowledge Base” created, we can start making
queries, retrieving data from it:
Small Language Models (SLM) 1537
User Query: The process begins when a user asks a question, such as “How
many bees are in a colony? Who lays eggs, and how much? How about common
pests and diseases?”
prompt = "How many bees are in a colony? Who lays eggs and \
how much? How about common pests and diseases?"
response = ollama.embeddings(
prompt=prompt,
model=EMB_MODEL
)
results = collection.query(
query_embeddings=[response["embedding"]],
n_results=5
)
data = results['documents']
prompt = (
f"Using this data: {data}. "
f"Respond to this prompt: {prompt}"
)
RAG Implementation 1538
output = ollama.generate(
model=MODEL,
prompt = (
f"Using this data: {data}. "
f"Respond to this prompt: {prompt}"
)
options={
"temperature": 0.0,
"top_k":10,
"top_p":0.5 }
)
Response Delivery: Finally, the system returns the generated answer to the
user.
print(output['response'])
results = collection.query(
query_embeddings=[response["embedding"]],
n_results=n_results
)
data = results['documents']
options={
"temperature": temp,
"top_k": top_k,
"top_p": top_p }
)
print(output['response'])
print(
f"\n[INFO] ==> The code for model: {MODEL}, "
f"took {elapsed_time}s to generate the answer.\n"
)
print(
f"\n[INFO] ==> The code for model: {MODEL}, "
f"took {elapsed_time}s to generate the answer.\n"
)
By the way, if the model used supports multiple languages, we can use it (for
example, Portuguese), even if the dataset was created in English:
Going Further
The small LLM models tested worked well at the edge, both in text and with im-
ages, but of course, they had high latency regarding the last one. A combination
of specific and dedicated models can lead to better results; for example, in real
cases, an Object Detection model (such as YOLO) can get a general description
and count of objects on an image that, once passed to an LLM, can help extract
essential insights and actions.
According to Avi Baum, CTO at Hailo,
Conclusion
This lab has demonstrated how a Raspberry Pi 5 can be transformed into a
potent AI hub capable of running large language models (LLMs) for real-time,
on-site data analysis and insights using Ollama and Python. The Raspberry
Pi’s versatility and power, coupled with the capabilities of lightweight LLMs
like Llama 3.2 and LLaVa-Phi-3-mini, make it an excellent platform for edge
computing applications.
The potential of running LLMs on the edge extends far beyond simple data
processing, as in this lab’s examples. Here are some innovative suggestions for
using this project:
1. Smart Home Automation:
• Integrate SLMs to interpret voice commands or analyze sensor data for
intelligent home automation. This could include real-time monitoring
and control of home devices, security systems, and energy management,
all processed locally without relying on cloud services.
2. Field Data Collection and Analysis:
• Deploy SLMs on Raspberry Pi in remote or mobile setups for real-time
data collection and analysis. This can be used in agriculture to monitor
crop health, in environmental studies for wildlife tracking, or in disaster
response for situational awareness and resource management.
3. Educational Tools:
• Create interactive educational tools that leverage SLMs to provide instant
feedback, language translation, and tutoring. This can be particularly
useful in developing regions with limited access to advanced technology
and internet connectivity.
4. Healthcare Applications:
• Use SLMs for medical diagnostics and patient monitoring. They can
provide real-time analysis of symptoms and suggest potential treatments.
Resources 1542
Resources
• 10-Ollama_Python_Library notebook
• 20-Ollama_Function_Calling notebook
• 30-Function_Calling_with_images notebook
• 40-RAG-simple-bee notebook
• calc_distance_image python script
Vision-Language Models (VLM)
Introduction
In this hands-on lab, we will continuously explore AI applications at the Edge,
from the basic setup of the Florence-2, Microsoft’s state-of-the-art vision foun-
dation model, to advanced implementations on devices like the Raspberry Pi.
We will learn to use Vision-Languageor Models (VLMs) for tasks such as cap-
tioning, object detection, grounding, segmentation, and OCR on a Raspberry
Pi.
1543
Introduction 1544
In this tutorial, we will explore how to use Florence-2 for real-time computer
vision applications, such as:
• Image captioning
• Object detection
• Segmentation
• Visual grounding
Multimodality Encoder
(Transformer)
Multimodality Decoder
(Transformer)
Output
(Text/Coordinates)
• Image Encoder: The image encoder is based on the DaViT (Dual Atten-
tion Vision Transformers) architecture. It converts input images into a
series of visual token embeddings. These embeddings serve as the foun-
dational representations of the visual content, capturing both spatial and
contextual information about the image.
Vision-Language Models (VLM) 1545
Technical Overview
Florence-2 introduces several innovative features that set it apart:
Architecture
Key Capabilities
Florence-2 excels in multiple vision tasks:
Zero-shot Performance
• Image Captioning: Achieves 135.6 CIDEr score on COCO
• Visual Grounding: 84.4% recall@1 on Flickr30k
• Object Detection: 37.5 mAP on COCO val2017
• Referring Expression: 67.0% accuracy on RefCOCO
Fine-tuned Performance
• Competitive with specialist models despite the smaller size
• Outperforms larger models in specific benchmarks
• EfÏcient adaptation to new tasks
Vision-Language Models (VLM) 1547
Practical Applications
Florence-2 can be applied across various domains:
1. Content Understanding
• Automated image captioning for accessibility
• Visual content moderation
• Media asset management
2. E-commerce
• Product image analysis
• Visual search
• Automated product tagging
3. Healthcare
• Medical image analysis
• Diagnostic assistance
• Research data processing
4. Security & Surveillance
• Object detection and tracking
• Anomaly detection
• Scene understanding
also features expandable storage via a microSD card slot and a PCIe 2.0 interface
for fast peripherals such as M.2 SSDs (Solid State Drives).
Environment configuration
To run Microsoft Florense-2 on the Raspberry Pi 5, we’ll need a few libraries:
1. Transformers:
• Florence-2 uses the transformers library from Hugging Face for
model loading and inference. This library provides the architecture
for working with pre-trained vision-language models, making it
easy to perform tasks like image captioning, object detection, and
more. Essentially, transformers helps in interacting with the model,
processing input prompts, and obtaining outputs.
2. PyTorch:
• PyTorch is a deep learning framework that provides the infrastruc-
ture needed to run the Florence-2 model, which includes tensor
operations, GPU acceleration (if a GPU is available), and model
training/inference functionalities. The Florence-2 model is trained
in PyTorch, and we need it to leverage its functions, layers, and
computation capabilities to perform inferences on the Raspberry Pi.
3. Timm (PyTorch Image Models):
• Florence-2 uses timm to access efÏcient implementations of vision
models and pre-trained weights. Specifically, the timm library is
utilized for the image encoder part of Florence-2, particularly for
Vision-Language Models (VLM) 1549
hostname -I
192.168.4.209
Setup and Installation 1550
Install Dependencies
Let’s set up and activate a Virtual Environment for working with Florence-2:
Install PyTorch
Running the above command on the SSH terminal, we can see the local URL
address to open the notebook:
The notebook with the code used on this initial test can be found on the Lab
GitHub:
• 10-florence2_test.ipynb
We can access it on the remote computer by entering the Raspberry Pi’s IP
address and the provided token in a web browser (copy the entire URL from
the terminal).
From the Home page, create a new notebook [Python 3 (ipykernel) ] and
copy and paste the example code from Hugging Face Hub.
The code is designed to run Florence-2 on a given image to perform object
detection. It loads the model, processes an image and a prompt, and then
generates a response to identify and describe the objects in the image.
• The processor helps prepare text and image inputs.
• The model takes the processed inputs to generate a meaningful response.
Setup and Installation 1552
• The post-processing step refines the generated output into a more inter-
pretable form, like bounding boxes for detected objects.
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Florence-2-base",
torch_dtype=torch_dtype,
trust_remote_code=True
).to(device)
processor = AutoProcessor.from_pretrained(
"microsoft/Florence-2-base",
trust_remote_code=True
)
prompt = "<OD>"
url = (
"https://huggingface.co/datasets/huggingface/"
"documentation-images/resolve/main/transformers/"
"tasks/car.jpg?download=true"
)
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(
text=prompt,
images=image,
return_tensors="pt"
).to(device, torch_dtype)
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
do_sample=False,
Vision-Language Models (VLM) 1553
num_beams=3,
)
generated_text = processor.batch_decode(
generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(
generated_text,
task="<OD>",
image_size=(image.width, image.height)
)
print(parsed_answer)
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
device = (
"cuda:0"
if torch.cuda.is_available()
else "cpu"
)
torch_dtype = (
torch.float16
if torch.cuda.is_available()
else torch.float32
)
Setup and Installation 1554
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Florence-2-base",
torch_dtype=torch_dtype,
trust_remote_code=True
).to(device)
processor = AutoProcessor.from_pretrained(
"microsoft/Florence-2-base",
trust_remote_code=True
)
• Model Initialization:
– AutoModelForCausalLM.from_pretrained() loads the pre-trained
Florence-2 model from Microsoft’s repository on Hugging Face. The
torch_dtype is set according to the available hardware (GPU/CPU),
and trust_remote_code=True allows the use of any custom code
that might be provided with the model.
– .to(device) moves the model to the appropriate device (either CPU
or GPU). In our case, it will be set to CPU.
• Processor Initialization:
– AutoProcessor.from_pretrained() loads the processor for Florence-
2. The processor is responsible for transforming text and image
inputs into a format the model can work with (e.g., encoding text,
normalizing images, etc.).
prompt = "<OD>"
url = "https://huggingface.co/datasets/huggingface/"
"documentation-images/resolve/main/transformers/"
"tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
Processing Inputs
inputs = processor(
text=prompt,
images=image,
return_tensors="pt"
).to(
device,
torch_dtype
)
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
do_sample=False,
num_beams=3,
)
generated_text = processor.batch_decode(
generated_ids,
skip_special_tokens=False
)[0]
parsed_answer = processor.post_process_generation(
generated_text,
task="<OD>",
image_size=(image.width, image.height)
)
print(parsed_answer)
Result
Running the code, we get as the Parsed Answer:
[{'<OD>': {
'bboxes': [
[34.23999786376953, 160.0800018310547, 597.4400024414062],
[371.7599792480469, 272.32000732421875, 241.67999267578125],
[303.67999267578125, 247.4399871826172, 454.0799865722656],
[276.7200012207031, 553.9199829101562, 370.79998779296875],
[96.31999969482422, 280.55999755859375, 198.0800018310547],
[371.2799987792969]
],
'labels': ['car', 'door handle', 'wheel', 'wheel']
}}]
It seems that at least a few objects were detected. We can also implement a
code to draw the bounding boxes in the find objects:
Setup and Installation 1558
Box (x0, y0, x1, y1): Location tokens correspond to the top-left and
bottom-right corners of a box.
And running
plot_bbox(image, parsed_answer['<OD>'])
We get:
Vision-Language Models (VLM) 1559
Florence-2 Tasks
Florence-2 is designed to perform a variety of computer vision and vision-
language tasks through prompts. These tasks can be activated by providing a
specific textual prompt to the model, as we saw with <OD> (Object Detection).
Florence-2’s versatility comes from combining these prompts, allowing us
to guide the model’s behavior to perform specific vision tasks. Changing the
prompt allows us to adapt Florence-2 to different tasks without needing task-
specific modifications in the architecture. This capability directly results from
Florence-2’s unified model architecture and large-scale multi-task training on
the FLD-5B dataset.
Here are some of the key tasks that Florence-2 can perform, along with
example prompts:
Image Captioning
• Prompt: "<CAPTION>"
• Description: Generates a textual description for an input image. This
task helps the model describe what is happening in the image, providing
a human-readable caption for content understanding.
Florence-2 Tasks 1560
Detailed Captioning
• Prompt: "<DETAILED_CAPTION>"
• Description: Generates a more detailed caption with more nuanced infor-
mation about the scene, such as the objects present and their relationships.
Visual Grounding
• Prompt: "<CAPTION_TO_PHRASE_GROUNDING>"
• Description: Links a textual description to specific regions in an image.
For example, given a prompt like “a green car,” the model highlights
where the green car is in the image. This is useful for human-computer
interaction, where you must find specific objects based on text.
Segmentation
• Prompt: "<REFERRING_EXPRESSION_SEGMENTATION>"
• Description: Performs segmentation based on a referring expression,
such as “the blue cup.” The model identifies and segments the specific
region containing the object mentioned in the prompt (all related pixels).
dogs_cats = Image.open('./images/dogs-cats.jpg')
table = Image.open('./images/table.jpg')
Let’s create a function to facilitate our exploration and to keep track of the
latency of the model for different tasks:
max_new_tokens=1024,
early_stopping=False,
do_sample=False,
num_beams=3,
)
generated_text = processor.batch_decode(
generated_ids,
skip_special_tokens=False
)[0]
parsed_answer = processor.post_process_generation(
generated_text,
task=task_prompt,
image_size=(image.width, image.height)
)
return parsed_answer
Caption
1. Dogs and Cats
run_example(task_prompt='<CAPTION>',image=dogs_cats)
2. Table
run_example(task_prompt='<CAPTION>',image=table)
DETAILED_CAPTION
1. Dogs and Cats
Vision-Language Models (VLM) 1563
run_example(task_prompt='<DETAILED_CAPTION>',image=dogs_cats)
2. Table
run_example(task_prompt='<DETAILED_CAPTION>',image=table)
MORE_DETAILED_CAPTION
1. Dogs and Cats
run_example(task_prompt='<MORE_DETAILED_CAPTION>',image=dogs_cats)
2. Table
Exploring computer vision and vision-language tasks 1564
run_example(task_prompt='< MORE_DETAILED_CAPTION>',image=table)
We can note that the more detailed the caption task, the longer the
latency and the possibility of mistakes (like “The image shows a
group of four cats and a dog in a garden”, instead of two dogs and
three cats).
OD - Object Detection
We can run the same previous function for object detection using the prompt
<OD>.
task_prompt = '<OD>'
results = run_example(task_prompt,image=dogs_cats)
print(results)
{'<OD>': {'bboxes': [
[737.79, 571.90, 1022.46, 980.48],
[0.51, 593.40, 211.45, 991.74],
[445.95, 721.40, 680.44, 850.43],
[39.42, 91.64, 491.00, 933.37],
[570.88, 184.83, 974.33, 782.84]
],
'labels': ['cat', 'cat', 'cat', 'dog', 'dog']
}}
plot_bbox(dogs_cats, results['<OD>'])
Vision-Language Models (VLM) 1565
task_prompt = '<OD>'
results = run_example(task_prompt,image=table)
plot_bbox(table, results['<OD>'])
DENSE_REGION_CAPTION
It is possible to mix the classic Object Detection with the Caption task in specific
sub-regions of the image:
task_prompt = '<DENSE_REGION_CAPTION>'
results = run_example(task_prompt,image=dogs_cats)
plot_bbox(dogs_cats, results['<DENSE_REGION_CAPTION>'])
results = run_example(task_prompt,image=table)
plot_bbox(table, results['<DENSE_REGION_CAPTION>'])
CAPTION_TO_PHRASE_GROUNDING
With this task, we can enter with a caption, such as “a wine glass”, “a wine
bottle,” or “a half orange,” and Florence-2 will localize the object in the image:
task_prompt = '<CAPTION_TO_PHRASE_GROUNDING>'
results = run_example(
task_prompt,
text_input="a wine bottle",
image=table
)
plot_bbox(table, results['<CAPTION_TO_PHRASE_GROUNDING>'])
results = run_example(
task_prompt,
text_input="a wine glass",
image=table
)
plot_bbox(table, results['<CAPTION_TO_PHRASE_GROUNDING>'])
Vision-Language Models (VLM) 1567
results = run_example(
task_prompt,
text_input="a half orange",
image=table
)
plot_bbox(table, results['<CAPTION_TO_PHRASE_GROUNDING>'])
Cascade Tasks
We can also enter the image caption as the input text to push Florence-2 to find
more objects:
task_prompt = '<CAPTION>'
results = run_example(task_prompt,image=dogs_cats)
text_input = results[task_prompt]
task_prompt = '<CAPTION_TO_PHRASE_GROUNDING>'
results = run_example(task_prompt, text_input,image=dogs_cats)
plot_bbox(dogs_cats, results['<CAPTION_TO_PHRASE_GROUNDING>'])
OPEN_VOCABULARY_DETECTION
task_prompt = '<OPEN_VOCABULARY_DETECTION>'
text = [
"a house",
"a tree",
"a standing cat at the left",
"a sleeping cat on the ground",
"a standing cat at the right",
"a yellow cat"
]
bbox_results = convert_to_od_format(
results['<OPEN_VOCABULARY_DETECTION>']
)
plot_bbox(dogs_cats, bbox_results)
Vision-Language Models (VLM) 1569
Note: Trying to use Florence-2 to find objects that were not found
can leads to mistakes (see exaamples on the Notebook).
Polygon (x1, y1, …, xn, yn): Location tokens represent the vertices
of a polygon in clockwise order.
Parameters:
- image_path: Path to the image file.
- prediction: Dictionary containing 'polygons' and 'labels'
keys. 'polygons' is a list of lists, each
containing vertices of a polygon. 'labels' is
a list of labels corresponding to each polygon.
- fill_mask: Boolean indicating whether to fill the polygons
with color.
"""
# Load the image
draw = ImageDraw.Draw(image)
)
else:
draw.polygon(
_polygon,
outline=color
)
task_prompt = '<REFERRING_EXPRESSION_SEGMENTATION>'
results = run_example(
task_prompt,
text_input="a wine bottle",
image=table
)
output_image = copy.deepcopy(table)
draw_polygons(output_image,
results['<REFERRING_EXPRESSION_SEGMENTATION>'],
fill_mask=True)
results = run_example(
task_prompt,
text_input="a german sheppard",
image=dogs_cats
)
output_image = copy.deepcopy(dogs_cats)
draw_polygons(output_image,
results['<REFERRING_EXPRESSION_SEGMENTATION>'],
fill_mask=True)
Exploring computer vision and vision-language tasks 1572
Region to Segmentation
With this task, it is also possible to give the object coordinates in the image
to segment it. The input format is '<loc_x1><loc_y1><loc_x2><loc_y2>',
[x1, y1, x2, y2] , which is the quantized coordinates in [0, 999].
For example, when running the code:
task_prompt = '<CAPTION_TO_PHRASE_GROUNDING>'
results = run_example(
task_prompt,
text_input="a half orange",
image=table
)
results
task_prompt = '<REGION_TO_SEGMENTATION>'
results = run_example(
task_prompt,
text_input=(
"<loc_343><loc_690>"
"<loc_531><loc_874>"
),
Vision-Language Models (VLM) 1573
image=table
)
output_image = copy.deepcopy(table)
draw_polygons(
output_image,
results['<REGION_TO_SEGMENTATION>'],
fill_mask=True
)
Region to Texts
We can also give the region (coordinates and ask for a caption):
task_prompt = '<REGION_TO_CATEGORY>'
results = run_example(
task_prompt,
text_input=(
"<loc_343><loc_690>"
"<loc_531><loc_874>"
),
image=table
)
results
Exploring computer vision and vision-language tasks 1574
{{
'<REGION_TO_CATEGORY>':
'orange<loc_343><loc_690>'
'<loc_531><loc_874>'
}
The model identified an orange in that region. Let’s ask for a description:
task_prompt = '<REGION_TO_DESCRIPTION>'
results = run_example(
task_prompt,
text_input=(
"<loc_343><loc_690>"
"<loc_531><loc_874>"
),
image=table
)
results
{
'<REGION_TO_CATEGORY>':
'orange<loc_343><loc_690>'
'<loc_531><loc_874>'
}
In this case, the description did not provide more details, but it could. Try
another example.
OCR
With Florence-2, we can perform Optical Character Recognition (OCR) on an
image, getting what is written on it (task_prompt = '<OCR>' and also get
the bounding boxes (location) for the detected text (ask_prompt = '<OCR_-
WITH_REGION>'). Those tasks can help extract and locate textual information
in images, such as reading signs, labels, or other forms of text in images.
Let’s upload a flyer from a talk in Brazil to Raspi. Let’s test works in another
language, here Portuguese):
flayer = Image.open('./images/embarcados.jpg')
# Display the image
plt.figure(figsize=(8, 8))
Vision-Language Models (VLM) 1575
plt.imshow(flayer)
plt.axis('off')
#plt.title("Image")
plt.show()
The description is very accurate. Let’s get to the more important words with
the task OCR:
task_prompt = '<OCR>'
run_example(task_prompt,image=flayer)
Exploring computer vision and vision-language tasks 1576
{'<OCR>':
'Machine Learning Café com Embarcado Embarcados '
'Democratizando a Inteligência Artificial para Paises em '
'25 de Setembro às 17h Desenvolvimento Toda quarta-feira '
'Marcelo Roval Professor na UNIFIEI e Transmissão via in '
'Co-Director do TinyML4D'}
task_prompt = '<OCR_WITH_REGION>'
results = run_example(task_prompt,image=flayer)
Let’s also create a function to draw bounding boxes around the detected
words:
output_image = copy.deepcopy(flayer)
draw_ocr_bboxes(output_image, results['<OCR_WITH_REGION>'])
Vision-Language Models (VLM) 1577
results['<OCR_WITH_REGION>']['labels']
'</s>Machine Learning',
'Café',
'com',
'Embarcado',
'Embarcados',
'Democratizando a Inteligência',
'Artificial para Paises em',
'25 de Setembro ás 17h',
'Desenvolvimento',
'Toda quarta-feira',
'Marcelo Roval',
'Professor na UNIFIEI e',
'Transmissão via',
'in',
'Co-Director do TinyML4D']
Latency Summary
The latency observed for different tasks using Florence-2 on the Raspberry Pi
(Raspi-5) varied depending on the complexity of the task:
• Image Captioning: It took approximately 16-17 seconds to generate a
caption for an image.
• Detailed Captioning: Increased latency to around 25-27 seconds, requir-
ing generating more nuanced scene descriptions.
• More Detailed Captioning: It took about 32-50 seconds, and the latency
increased as the description grew more complex.
• Object Detection: It took approximately 20-41 seconds, depending on
the image’s complexity and the number of detected objects.
Fine-Tunning 1578
Running complex tasks can use all 8 GB of the Raspi-5’s memory. For
example, the above screenshot during the Florence OD task shows
4 CPUs at full speed and over 5 GB of memory in use. Consider
increasing the SWAP memory to 2 GB.
Fine-Tunning
As explored in this lab, Florence supports many tasks out of the box, including
captioning, object detection, OCR, and more. However, like other pre-trained
foundational models, Florence-2 may need domain-specific knowledge. For
example, it may need to improve with medical or satellite imagery. In such
cases, fine-tuning with a custom dataset is necessary. The Roboflow tutorial,
How to Fine-tune Florence-2 for Object Detection Tasks, shows how to fine-tune
Florence-2 on object detection datasets to improve model performance for our
specific use case.
Based on the above tutorial, it is possible to fine-tune the Florence-2 model
to detect boxes and wheels used in previous labs:
Vision-Language Models (VLM) 1579
It is important to note that after fine-tuning, the model can still detect classes
that don’t belong to our custom dataset, like cats, dogs, grapes, etc, as seen
before).
The complete fine-tunning project using a previously annotated dataset in
Roboflow and executed on CoLab can be found in the notebook:
• 30-Finetune_florence_2_on_detection_dataset_box_vs_wheel.ipynb
In another example, in the post, Fine-tuning Florence-2 - Microsoft’s Cutting-
edge Vision Language Models, the authors show an example of fine-tuning
Florence on DocVQA. The authors report that Florence 2 can perform visual ques-
tion answering (VQA), but the released models don’t include VQA capability.
Conclusion
Florence-2 offers a versatile and powerful approach to vision-language tasks at
the edge, providing performance that rivals larger, task-specific models, such
as YOLO for object detection, BERT/RoBERTa for text analysis, and specialized
OCR models.
Thanks to its multi-modal transformer architecture, Florence-2 is more flexi-
ble than YOLO in terms of the tasks it can handle. These include object detection,
image captioning, and visual grounding.
Unlike BERT, which focuses purely on language, Florence-2 integrates vision
and language, allowing it to excel in applications that require both modalities,
such as image captioning and visual grounding.
Moreover, while traditional OCR models such as Tesseract and EasyOCR are
designed solely for recognizing and extracting text from images, Florence-2’s
OCR capabilities are part of a broader framework that includes contextual
understanding and visual-text alignment. This makes it particularly useful
for scenarios that require both reading text and interpreting its context within
images.
Overall, Florence-2 stands out for its ability to seamlessly integrate various
vision-language tasks into a unified model that is efÏcient enough to run on
edge devices like the Raspberry Pi. This makes it a compelling choice for
developers and researchers exploring AI applications at the edge.
Trade-offs
1. Performance vs. Specialized Models
• YOLO series may offer faster inference for pure object detection
• Specialized OCR models might handle complex document layouts
better
• BERT/RoBERTa provide deeper language understanding for text-
only tasks
2. Resource Requirements
• Higher latency on edge devices (15-200s depending on task)
• Requires careful memory management on Raspberry Pi
• It may need optimization for real-time applications
3. Deployment Considerations
• Initial setup is more complex than single-purpose models
• Requires understanding of multiple task types and prompts
• The learning curve for optimal prompt engineering
2. Multi-modal Applications
• Content moderation systems
• Accessibility tools
• Document analysis workflows
3. Rapid Prototyping
• Quick deployment of vision capabilities
• Testing multiple vision tasks without separate models
• Proof-of-concept development
Future Implications
Florence-2 represents a shift toward unified vision models that could eventually
replace task-specific architectures in many applications. While specialized mod-
els maintain advantages in specific scenarios, the convenience and efÏciency of
unified models like Florence-2 make them increasingly attractive for real-world
deployments.
The lab demonstrates Florence-2’s viability on edge devices, suggesting future
IoT, mobile computing, and embedded systems applications where deploying
multiple specialized models would be impractical.
Resources
• 10-florence2_test.ipynb
• 20-florence_2.ipynb
• 30-Finetune_florence_2_on_detection_dataset_box_vs_wheel.ipynb
Shared Labs
The labs in this section cover topics and techniques that are applicable across
different hardware platforms. These labs are designed to be independent
of specific boards, allowing you to focus on the fundamental concepts and
algorithms used in (tiny) ML applications.
By exploring these shared labs, you’ll gain a deeper understanding of the com-
mon challenges and solutions in embedded machine learning. The knowledge
and skills acquired here will be valuable regardless of the specific hardware
you work with in the future.
1583
KWS Feature Engineering
Overview
In this hands-on tutorial, the emphasis is on the critical role that feature en-
gineering plays in optimizing the performance of machine learning models
applied to audio classification tasks, such as speech recognition. It is essential to
be aware that the performance of any machine learning model relies heavily on
1585
The KWS 1586
the quality of features used, and we will deal with “under-the-hood” mechanics
of feature extraction, mainly focusing on Mel-frequency Cepstral CoefÏcients
(MFCCs), a cornerstone in the field of audio signal processing.
Machine learning models, especially traditional algorithms, don’t understand
audio waves. They understand numbers arranged in some meaningful way,
i.e., features. These features encapsulate the characteristics of the audio signal,
making it easier for models to distinguish between different sounds.
This tutorial will deal with generating features specifically for au-
dio classification. This can be particularly interesting for applying
machine learning to a variety of audio data, whether for speech
recognition, music categorization, insect classification based on
wingbeat sounds, or other sound analysis tasks
The KWS
The most common TinyML application is Keyword Spotting (KWS), a subset
of the broader field of speech recognition. While general speech recognition
transcribes all spoken words into text, Keyword Spotting focuses on detecting
specific “keywords” or “wake words” in a continuous audio stream. The system
is trained to recognize these keywords as predefined phrases or words, such
as yes or no. In short, KWS is a specialized form of speech recognition with its
own set of challenges and requirements.
Here a typical KWS Process using MFCC Feature Converter:
Applications of KWS
• Voice Assistants: In devices like Amazon’s Alexa or Google Home, KWS
is used to detect the wake word (“Alexa” or “Hey Google”) to activate the
device.
• Voice-Activated Controls: In automotive or industrial settings, KWS can
be used to initiate specific commands like “Start engine” or “Turn off
lights.”
• Security Systems: Voice-activated security systems may use KWS to
authenticate users based on a spoken passphrase.
KWS Feature Engineering 1587
Overview to MFCCs
What are MFCCs?
Mel-frequency Cepstral CoefÏcients (MFCCs) are a set of features derived from
the spectral content of an audio signal. They are based on human auditory
perceptions and are commonly used to capture the phonetic characteristics of
an audio signal. The MFCCs are computed through a multi-step process that
includes pre-emphasis, framing, windowing, applying the Fast Fourier Trans-
form (FFT) to convert the signal to the frequency domain, and finally, applying
the Discrete Cosine Transform (DCT). The result is a compact representation of
the original audio signal’s spectral characteristics.
The image below shows the words YES and NO in their MFCC representation:
Overview to MFCCs 1590
Computing MFCCs
The computation of Mel-frequency Cepstral CoefÏcients (MFCCs) involves
several key steps. Let’s walk through these, which are particularly important
for Keyword Spotting (KWS) tasks on TinyML devices.
• Pre-emphasis: The first step is pre-emphasis, which is applied to accen-
tuate the high-frequency components of the audio signal and balance the
frequency spectrum. This is achieved by applying a filter that amplifies the
difference between consecutive samples. The formula for pre-emphasis
is: 𝑦(𝑡) = 𝑥(𝑡) − 𝛼𝑥(𝑡 − 1), where 𝛼 is the pre-emphasis factor, typically
around 0.97.
• Framing: Audio signals are divided into short frames (the frame length),
usually 20 to 40 milliseconds. This is based on the assumption that fre-
quencies in a signal are stationary over a short period. Framing helps in
analyzing the signal in such small time slots. The frame stride (or step)
will displace one frame and the adjacent. Those steps could be sequential
or overlapped.
• Windowing: Each frame is then windowed to minimize the disconti-
nuities at the frame boundaries. A commonly used window function
is the Hamming window. Windowing prepares the signal for a Fourier
transform by minimizing the edge effects. The image below shows three
frames (10, 20, and 30) and the time samples after windowing (note that
the frame length and frame stride are 20 ms):
KWS Feature Engineering 1591
• Fast Fourier Transform (FFT) The Fast Fourier Transform (FFT) is applied
to each windowed frame to convert it from the time domain to the fre-
quency domain. The FFT gives us a complex-valued representation that
includes both magnitude and phase information. However, for MFCCs,
only the magnitude is used to calculate the Power Spectrum. The power
spectrum is the square of the magnitude spectrum and measures the
energy present at each frequency component.
• Mel Filter Banks: The frequency domain is then mapped to the Mel
scale, which approximates the human ear’s response to different frequen-
Hands-On using Python 1592
cies. The idea is to extract more features (more filter banks) in the lower
frequencies and less in the high frequencies. Thus, it performs well on
sounds distinguished by the human ear. Typically, 20 to 40 triangular
filters extract the Mel-frequency energies. These energies are then log-
transformed to convert multiplicative factors into additive ones, making
them more suitable for further processing.
• Discrete Cosine Transform (DCT): The last step is to apply the Discrete
Cosine Transform (DCT) to the log Mel energies. The DCT helps to decor-
relate the energies, effectively compressing the data and retaining only
the most discriminative features. Usually, the first 12-13 DCT coefÏcients
are retained, forming the final MFCC feature vector.
Conclusion
What Feature Extraction technique should we use?
KWS Feature Engineering 1593
Resources
• Audio_Data_Analysis Colab Notebook
DSP Spectral Features
Overview
TinyML projects related to motion (or vibration) involve data from IMUs (usu-
ally accelerometers and Gyroscopes). These time-series type datasets should
be preprocessed before inputting them into a Machine Learning model training,
which is a challenging area for embedded machine learning. Still, Edge Impulse
1595
Extracting Features Review 1596
helps overcome this complexity with its digital signal processing (DSP) pre-
processing step and, more specifically, the Spectral Features Block for Inertial
sensors.
But how does it work under the hood? Let’s dig into it.
The result is similar when this analysis is done over another dataset
with the same principle, using a different sampling frequency, 62.5 Hz
instead of 50 Hz.
Data Pre-Processing
The raw data captured by the accelerometer (a “time series” data) should be
converted to “tabular data” using one of the typical Feature Extraction methods
described in the last section.
We should segment the data using a sliding window over the sample data for
feature extraction. The project captured accelerometer data every 10 seconds
with a sample rate of 62.5 Hz. A 2-second window captures 375 data points (3
axis × 2 seconds × 62.5 samples). The window is slid every 80 ms, creating a
larger dataset where each instance has 375 “raw features.”
On the Studio, the previous version (V1) of the Spectral Analysis Block
extracted as time-domain features only the RMS, and for the frequency-domain,
the peaks and frequency (using FFT) and the power characteristics (PSD) of the
signal over time resulting in a fixed tabular dataset of 33 features (11 per each
axis),
DSP Spectral Features 1599
Clone the public project. You can also follow the explanation, play-
ing with the code using my Google CoLab Notebook: Edge Impulse
Spectral Analysis Block Notebook.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from scipy.stats import skew, kurtosis
from scipy import signal
from scipy.signal import welch
from scipy.stats import entropy
from sklearn import preprocessing
Data Pre-Processing 1600
import pywt
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['lines.linewidth'] = 3
From the studied project, let’s choose a data sample from accelerometers as
below:
• Window size of 2 seconds: [2,000] ms
• Sample frequency: [62.5] Hz
• We will choose the [None] filter (for simplicity) and a
• FFT length: [16].
f = 62.5 # Hertz
wind_sec = 2 # seconds
FFT_Lenght = 16
axis = ['accX', 'accY', 'accZ']
n_sensors = len(axis)
Selecting the Raw Features on the Studio Spectral Analysis tab, we can copy
all 375 data points of a particular 2-second window to the clipboard.
DSP Spectral Features 1601
data = [
-5.6330, 0.2376, 9.8701,
-5.9442, 0.4830, 9.8701,
-5.4217, ...
]
No_raw_features = len(data)
N = int(No_raw_features/n_sensors)
The total raw features are 375, but we will work with each axis individually,
where 𝑁 = 125 (number of samples per axis).
We aim to understand how Edge Impulse gets the processed features.
So, you should also past the processed features on a variable (to compare the
calculated features in Python with the ones provided by the Studio) :
features = [
2.7322, -0.0978, -0.3813,
2.3980, 3.8924, 24.6841,
9.6303, ...
]
N_feat = len(features)
N_feat_axis = int(N_feat/n_sensors)
accX = data[0::3]
accY = data[1::3]
accZ = data[2::3]
sensors = [accX, accY, accZ]
plot_data(sensors, axis, 'Raw Features')
• It removes bias: If the data is biased, subtracting the mean can remove it
and allow for a more accurate analysis.
• It can reveal patterns: Centering the data can help uncover patterns that
might be hidden if the data is not centered. For example, centering the
data can help you identify trends over time if you analyze a time series
dataset.
• It can improve performance: In some machine learning algorithms, cen-
tering the data can improve performance by reducing the influence of
outliers and making the data more easily comparable. Overall, subtracting
the mean is a simple but powerful technique that can be used to improve
the analysis and interpretation of data.
dtmean = [
(sum(x) / len(x))
for x in sensors
]
[
print('mean_' + x + ' =', round(y, 4))
for x, y in zip(axis, dtmean)
][0]
1
𝑥RMS = √ (𝑥21 + 𝑥22 + ⋯ + 𝑥2𝑛 )
𝑛
NOTE that the RMS value is different for the original raw data, and
after subtracting the mean
We can compare the calculated RMS values here with the ones presented by
Edge Impulse:
rms_accX= 2.7322
rms_accY= 0.7833
rms_accZ= 0.1383
Compared with Edge Impulse result features:
[2.7322, 0.7833, 0.1383]
Skewness and kurtosis calculation
In statistics, skewness and kurtosis are two ways to measure the shape of a
distribution.
Here, we can see the sensor values distribution:
• A negative skew indicates that the tail is on the left side of the distribution,
which extends towards more negative values.
• A positive skew indicates that the tail is on the right side of the distribution,
which extends towards more positive values.
• A zero value indicates no skewness in the distribution at all, meaning the
distribution is perfectly symmetrical.
skew_accX= -0.099
skew_accY= 0.1756
skew_accZ= 6.9463
Compared with Edge Impulse result features:
[-0.0978, 0.1735, 6.8629]
Kurtosis is a measure of whether or not a distribution is heavy-tailed or
light-tailed relative to a normal distribution.
Spectral features 1606
kurt_accX= -0.3475
kurt_accY= 1.2673
kurt_accZ= 68.1123
Compared with Edge Impulse result features:
[-0.3813, 1.1696, 65.3726]
Spectral features
The filtered signal is passed to the Spectral power section, which computes the
FFT to generate the spectral features.
Since the sampled window is usually larger than the FFT size, the window
will be broken into frames (or “sub-windows”), and the FFT is calculated over
each frame.
FFT length - The FFT size. This determines the number of FFT bins and the
resolution of frequency peaks that can be separated. A low number means
more signals will average together in the same FFT bin, but it also reduces the
number of features and model size. A high number will separate more signals
into separate bins, generating a larger model.
DSP Spectral Features 1607
• The total number of Spectral Power features will vary depending on how
you set the filter and FFT parameters. With No filtering, the number of
features is 1/2 of the FFT Length.
plt.plot(fax,Pax, label='accX')
plt.plot(fay,Pay, label='accY')
plt.plot(faz,Paz, label='accZ')
plt.legend(loc='upper right')
plt.xlabel('Frequency (Hz)')
#plt.ylabel('PSD [V**2/Hz]')
plt.ylabel('Power')
plt.title('Power spectrum P(f) using Welch's method')
plt.grid()
plt.box(False)
plt.show()
Spectral features 1608
Besides the Power Spectrum, we can also include the skewness and kurtosis
of the features in the frequency domain (should be available on a new version):
Let’s now list all Spectral features per axis and compare them with EI:
Time-frequency domain
Wavelets
• Type: Wavelet
• Wavelet Decomposition Level: 1
• Wavelet: bior1.3
Time-frequency domain 1610
wavelet_name='bior1.3'
num_layer = 1
wavelet = pywt.Wavelet(wavelet_name)
[phi_d,psi_d,phi_r,psi_r,x] = wavelet.wavefun(level=5)
plt.plot(x, psi_d, color='red')
plt.title('Wavelet Function')
plt.ylabel('Value')
plt.xlabel('Time')
plt.grid()
plt.box(False)
plt.show()
DSP Spectral Features 1611
features = [
3.6251, 0.0615, 0.0615,
-7.3517, -2.7641, 2.8462,
5.0924, ...
]
N_feat = len(features)
N_feat_axis = int(N_feat/n_sensors)
Edge Impulse computes the Discrete Wavelet Transform (DWT) for each one
of the Wavelet Decomposition levels selected. After that, the features will be
extracted.
In the case of Wavelets, the extracted features are basic statistical values, crossing
values, and entropy. There are, in total, 14 features per layer as below:
• [11] Statiscal Features: n5, n25, n75, n95, mean, median, standard devia-
tion (std), variance (var) root mean square (rms), kurtosis, and skewness
(skew).
• [2] Crossing Features: Zero crossing rate (zcross) and mean crossing rate
(mcross) are the times that the signal passes through the baseline (𝑦 = 0)
and the average level (y = u) per unit of time, respectively
• [1] Complexity Feature: Entropy is a characteristic measure of the com-
plexity of the signal
All the above 14 values are calculated for each Layer (including L0, the
original signal)
Time-frequency domain 1612
• The total number of features varies depending on how you set the filter
and the number of layers. For example, with [None] filtering and Level[1],
the number of features per axis will be 14 × 2 (L0 and L1) = 28. For the
three axes, we will have a total of 84 features.
Wavelet Analysis
Wavelet analysis decomposes the signal (accX, accY, and accZ) into different
frequency components using a set of filters, which separate these components
into low-frequency (slowly varying parts of the signal containing long-term
patterns), such as accX_l1, accY_l1, accZ_l1 and, high-frequency (rapidly vary-
ing parts of the signal containing short-term patterns) components, such as
accX_d1, accY_d1, accZ_d1, permitting the extraction of features for further
analysis or classification.
Feature Extraction
Let’s start with the basic statistical features. Note that we apply the function
for both the original signals and the resultant cAs from the DWT:
def calculate_statistics(signal):
n5 = np.percentile(signal, 5)
n25 = np.percentile(signal, 25)
n75 = np.percentile(signal, 75)
n95 = np.percentile(signal, 95)
median = np.percentile(signal, 50)
mean = np.mean(signal)
std = np.std(signal)
var = np.var(signal)
rms = np.sqrt(np.mean(np.square(signal)))
return [n5, n25, n75, n95, median, mean, std, var, rms]
Zero crossing (zcross) is the number of times the wavelet coefÏcient crosses
the zero axis. It can be used to measure the signal’s frequency content since
high-frequency signals tend to have more zero crossings than low-frequency
signals.
Time-frequency domain 1614
Mean crossing (mcross), on the other hand, is the number of times the
wavelet coefÏcient crosses the mean of the signal. It can be used to measure
the amplitude since high-amplitude signals tend to have more mean crossings
than low-amplitude signals.
def getZeroCrossingRate(arr):
my_array = np.array(arr)
zcross = float(
"{:.2f}".format(
(((my_array[:-1] * my_array[1:]) < 0).sum()) / len(arr)
)
)
return zcross
def getMeanCrossingRate(arr):
mcross = getZeroCrossingRate(np.array(arr) - np.mean(arr))
return mcross
def calculate_crossings(list):
zcross=[]
mcross=[]
for i in range(len(list)):
zcross_i = getZeroCrossingRate(list[i])
zcross.append(zcross_i)
mcross_i = getMeanCrossingRate(list[i])
mcross.append(mcross_i)
return zcross, mcross
cross_l0 = calculate_crossings(sensors)
cross_l1 = calculate_crossings(sensors_l1)
Let’s now list all the wavelet features and create a list by layers.
DSP Spectral Features 1615
L1_features_names = [
"L1-n5", "L1-n25", "L1-n75", "L1-n95", "L1-median",
"L1-mean", "L1-std", "L1-var", "L1-rms", "L1-skew",
"L1-Kurtosis", "L1-zcross", "L1-mcross", "L1-entropy"
]
L0_features_names = [
"L0-n5", "L0-n25", "L0-n75", "L0-n95", "L0-median",
"L0-mean", "L0-std", "L0-var", "L0-rms", "L0-skew",
"L0-Kurtosis", "L0-zcross", "L0-mcross", "L0-entropy"
]
all_feat_l0 = []
for i in range(len(axis)):
feat_l0 = (
stat_feat_l0[i]
+ [skew_l0[i]]
+ [kurtosis_l0[i]]
+ [cross_l0[0][i]]
+ [cross_l0[1][i]]
+ [entropy_l0[i]]
)
[print(axis[i] + ' +x+= ', round(y, 4))
for x, y in zip(LO_features_names, feat_l0)][0]
all_feat_l0.append(feat_l0)
all_feat_l0 = [
item
for sublist in all_feat_l0
for item in sublist
]
print(f"\nAll L0 Features = {len(all_feat_l0)}")
all_feat_l1 = []
for i in range(len(axis)):
feat_l1 = (
stat_feat_l1[i]
+ [skew_l1[i]]
+ [kurtosis_l1[i]]
+ [cross_l1[0][i]]
+ [cross_l1[1][i]]
+ [entropy_l1[i]]
)
[print(axis[i]+' '+x+'= ', round(y, 4))
for x,y in zip(L1_features_names, feat_l1)][0]
all_feat_l1.append(feat_l1)
Time-frequency domain 1616
all_feat_l1 = [
item
for sublist in all_feat_l1
for item in sublist
]
print(f"\nAll L1 Features = {len(all_feat_l1)}")
DSP Spectral Features 1617
Conclusion
Edge Impulse Studio is a powerful online platform that can handle the pre-
processing task for us. Still, given our engineering perspective, we want to
understand what is happening under the hood. This knowledge will help us
find the best options and hyper-parameters for tuning our projects.
Daniel Situnayake wrote in his blog: “Raw sensor data is highly dimensional
and noisy. Digital signal processing algorithms help us sift the signal from
the noise. DSP is an essential part of embedded engineering, and many edge
processors have on-board acceleration for DSP. As an ML engineer, learning
basic DSP gives you superpowers for handling high-frequency time series data
in your models.” I recommend you read Dan’s excellent post in its totality: nn
to cpp: What you need to know about porting deep learning models to the
edge.
APPENDIX
1619
PhD Survival Guide
1621
Career Advice
1623
On Oral Presentation Advice 1624
Video Resources
1. You and Your Research by Richard Hamming A video lecture of Richard
Hamming’s talk on achieving significant research contributions.
2. How to Write a Great Research Paper Simon Peyton Jones shares tips on
writing research papers and presenting ideas effectively.
REFERENCES
1625
References
0001, Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q.
Yan, Haichen Shen, Meghan Cowan, et al. 2018a. “TVM: An Automated
End-to-End Optimizing Compiler for Deep Learning.” In 13th USENIX
Symposium on Operating Systems Design and Implementation (OSDI 18), 578–94.
https://www.usenix.org/conference/osdi18/presentation/chen.
———, et al. 2018b. “TVM: An Automated End-to-End Optimizing Compiler
for Deep Learning.” In OSDI, 578–94. https://www.usenix.org/conferenc
e/osdi18/presentation/chen.
0003, Mu Li, David G. Andersen, Alexander J. Smola, and Kai Yu. 2014.
“Communication EfÏcient Distributed Machine Learning with the Param-
eter Server.” In Advances in Neural Information Processing Systems 27: An-
nual Conference on Neural Information Processing Systems 2014, December 8-
13 2014, Montreal, Quebec, Canada, edited by Zoubin Ghahramani, Max
Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, 19–27.
https://proceedings.neurips.cc/paper/2014/hash/1ff1de774005f8da13f42
943881c655f-Abstract.html.
Abadi, Martin, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov,
Kunal Talwar, and Li Zhang. 2016. “Deep Learning with Differential Pri-
vacy.” In Proceedings of the 2016 ACM SIGSAC Conference on Computer and
Communications Security, 308–18. CCS ’16. New York, NY, USA: ACM.
https://doi.org/10.1145/2976749.2978318.
Abadi, Martı́n, Ashish Agarwal, Paul Barham, et al. 2015. “TensorFlow: Large-
Scale Machine Learning on Heterogeneous Systems.” Google Brain.
Abadi, Martı́n, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, et al. 2016. “TensorFlow: Large-Scale Ma-
chine Learning on Heterogeneous Distributed Systems.” arXiv Preprint
arXiv:1603.04467, March. http://arxiv.org/abs/1603.04467v2.
Abadi, Martı́n, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, et al. 2016. “TensorFlow: A System for Large-Scale
Machine Learning.” In 12th USENIX Symposium on Operating Systems Design
and Implementation (OSDI 16), 265–83. USENIX Association. https://www.
usenix.org/conference/osdi16/technical-sessions/presentation/abadi.
Abdelkader, Ahmed, Michael J. Curry, Liam Fowl, Tom Goldstein, Avi Schwarzschild,
Manli Shu, Christoph Studer, and Chen Zhu. 2020. “Headless Horse-
man: Adversarial Attacks on Transfer Learning Models.” In ICASSP 2020 -
2020 IEEE International Conference on Acoustics, Speech and Signal Processing
1627
References 1628
Amiel, Frederic, Christophe Clavier, and Michael Tunstall. 2006. “Fault Analysis
of DPA-Resistant Algorithms.” In Fault Diagnosis and Tolerance in Cryptogra-
phy, 223–36. Springer; Springer Berlin Heidelberg. https://doi.org/10.100
7/11889700/_20.
Amodei, Dario, Danny Hernandez, et al. 2018. “AI and Compute.” OpenAI
Blog. https://openai.com/research/ai-and-compute.
Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman,
and Dan Mané. 2016. “Concrete Problems in AI Safety.” arXiv Preprint
arXiv:1606.06565, June. http://arxiv.org/abs/1606.06565v2.
Andrae, Anders, and Tomas Edler. 2015. “On Global Electricity Usage of
Communication Technology: Trends to 2030.” Challenges 6 (1): 117–57.
https://doi.org/10.3390/challe6010117.
Antonakakis, Manos, Tim April, Michael Bailey, Matt Bernhard, Elie Bursztein,
Jaime Cochran, Zakir Durumeric, et al. 2017. “Understanding the Mi-
rai Botnet.” In 26th USENIX Security Symposium (USENIX Security 17),
16:1093–1110.
Ardila, Rosana, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer,
Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and
Gregor Weber. 2020. “Common Voice: A Massively-Multilingual Speech
Corpus.” In Proceedings of the Twelfth Language Resources and Evaluation Confer-
ence, 4218–22. Marseille, France: European Language Resources Association.
https://aclanthology.org/2020.lrec-1.520.
Arifeen, Tooba, Abdus Sami Hassan, and Jeong-A Lee. 2020. “Approximate
Triple Modular Redundancy: A Survey.” IEEE Access 8: 139851–67. https:
//doi.org/10.1109/access.2020.3012673.
Arivazhagan, Manoj Ghuhan, Vinay Aggarwal, Aaditya Kumar Singh, and
Sunav Choudhary. 2019. “Federated Learning with Personalization Layers.”
CoRR abs/1912.00818 (December). http://arxiv.org/abs/1912.00818v1.
Asonov, D., and R. Agrawal. n.d. “Keyboard Acoustic Emanations.” In IEEE
Symposium on Security and Privacy, 2004. Proceedings. 2004, 3–11. IEEE; IEEE.
https://doi.org/10.1109/secpri.2004.1301311.
Ateniese, Giuseppe, Luigi V. Mancini, Angelo Spognardi, Antonio Villani,
Domenico Vitali, and Giovanni Felici. 2015. “Hacking Smart Machines
with Smarter Ones: How to Extract Meaningful Data from Machine Learn-
ing Classifiers.” International Journal of Security and Networks 10 (3): 137.
https://doi.org/10.1504/ijsn.2015.071829.
Attia, Zachi I., Alan Sugrue, Samuel J. Asirvatham, Michael J. Ackerman, Suraj
Kapa, Paul A. Friedman, and Peter A. Noseworthy. 2018. “Noninvasive
Assessment of Dofetilide Plasma Concentration Using a Deep Learning
(Neural Network) Analysis of the Surface Electrocardiogram: A Proof of
Concept Study.” PLOS ONE 13 (8): e0201059. https://doi.org/10.1371/jo
urnal.pone.0201059.
Aygun, Sercan, Ece Olcay Gunes, and Christophe De Vleeschouwer. 2021.
“EfÏcient and Robust Bitstream Processing in Binarised Neural Networks.”
Electronics Letters 57 (5): 219–22. https://doi.org/10.1049/ell2.12045.
Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. “Layer Nor-
malization.” arXiv Preprint arXiv:1607.06450, July. http://arxiv.org/abs/16
07.06450v1.
References 1630
Barroso, Luiz André, Urs Hölzle, and Parthasarathy Ranganathan. 2019. The
Datacenter as a Computer: Designing Warehouse-Scale Machines. Springer
International Publishing. https://doi.org/10.1007/978-3-031-01761-2.
Baydin, Atilim Gunes, Barak A. Pearlmutter, Alexey Andreyevich Radul, and
Jeffrey Mark Siskind. 2017a. “Automatic Differentiation in Machine Learn-
ing: A Survey.” J. Mach. Learn. Res. 18: 153:1–43. https://jmlr.org/papers/
v18/17-468.html.
———. 2017b. “Automatic Differentiation in Machine Learning: A Survey.”
J. Mach. Learn. Res. 18 (153): 153:1–43. https://jmlr.org/papers/v18/17-
468.html.
Beaton, Albert E., and John W. Tukey. 1974. “The Fitting of Power Series, Mean-
ing Polynomials, Illustrated on Band-Spectroscopic Data.” Technometrics 16
(2): 147. https://doi.org/10.2307/1267936.
Bedford Taylor, Michael. 2017. “The Evolution of Bitcoin Hardware.” Computer
50 (9): 58–66. https://doi.org/10.1109/mc.2017.3571056.
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret
Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language
Models Be Too Big? �.” In Proceedings of the 2021 ACM Conference on Fairness,
Accountability, and Transparency, 610–23. ACM. https://doi.org/10.1145/34
42188.3445922.
Bengio, Emmanuel, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2015.
“Conditional Computation in Neural Networks for Faster Models.” arXiv
Preprint arXiv:1511.06297, November. http://arxiv.org/abs/1511.06297v2.
Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. 2013b. “Estimat-
ing or Propagating Gradients Through Stochastic Neurons for Conditional
Computation.” arXiv Preprint, August. http://arxiv.org/abs/1308.3432v1.
———. 2013a. “Estimating or Propagating Gradients Through Stochastic
Neurons for Conditional Computation.” arXiv Preprint arXiv:1308.3432,
August. http://arxiv.org/abs/1308.3432v1.
Ben-Nun, Tal, and Torsten Hoefler. 2019. “Demystifying Parallel and Dis-
tributed Deep Learning: An in-Depth Concurrency Analysis.” ACM Com-
puting Surveys 52 (4): 1–43. https://doi.org/10.1145/3320060.
Berger, Vance W., and YanYan Zhou. 2014. “Wiley StatsRef: Statistics Reference
Online.” Wiley Statsref: Statistics Reference Online. Wiley. https://doi.org/10
.1002/9781118445112.stat06558.
Bergstra, James, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan
Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and
Yoshua Bengio. 2010. “Theano: A CPU and GPU Math Compiler in Python.”
In Proceedings of the 9th Python in Science Conference, 4:18–24. 1. SciPy. https:
//doi.org/10.25080/majora-92bf1922-003.
Beyer, Lucas, Olivier J. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron
van den Oord. 2020. “Are We Done with ImageNet?” arXiv Preprint
arXiv:2006.07159, June. http://arxiv.org/abs/2006.07159v1.
Bhagoji, Arjun Nitin, Warren He, Bo Li, and Dawn Song. 2018. “Practical Black-
Box Attacks on Deep Neural Networks Using EfÏcient Query Mechanisms.”
In Computer Vision – ECCV 2018, 158–74. Springer International Publishing.
https://doi.org/10.1007/978-3-030-01258-8/_10.
References 1632
Bhamra, Ran, Adrian Small, Christian Hicks, and Olimpia Pilch. 2024. “Impact
Pathways: Geopolitics, Risk and Ethics in Critical Minerals Supply Chains.”
International Journal of Operations &Amp; Production Management, September.
https://doi.org/10.1108/ijopm-03-2024-0228.
Biggio, Battista, Blaine Nelson, and Pavel Laskov. 2012. “Poisoning Attacks
Against Support Vector Machines.” In Proceedings of the 29th International
Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 -
July 1, 2012. icml.cc / Omnipress. http://icml.cc/2012/papers/880.pdf.
Binkert, Nathan, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali
Saidi, Arkaprava Basu, Joel Hestness, et al. 2011. “The Gem5 Simulator.”
ACM SIGARCH Computer Architecture News 39 (2): 1–7. https://doi.org/10
.1145/2024716.2024718.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
Blackwood, Jayden, Frances C. Wright, Nicole J. Look Hong, and Anna R.
Gagliardi. 2019. “Quality of DCIS Information on the Internet: A Content
Analysis.” Breast Cancer Research and Treatment 177 (2): 295–305. https:
//doi.org/10.1007/s10549-019-05315-8.
Bolchini, Cristiana, Luca Cassano, Antonio Miele, and Alessandro Toschi. 2023.
“Fast and Accurate Error Simulation for CNNs Against Soft Errors.” IEEE
Transactions on Computers 72 (4): 984–97. https://doi.org/10.1109/tc.2022.
3184274.
Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora,
Sydney von Arx, Michael S. Bernstein, et al. 2021. “On the Opportunities
and Risks of Foundation Models.” arXiv Preprint arXiv:2108.07258, August.
http://arxiv.org/abs/2108.07258v3.
Bouri, Elie. 2015. “A Broadened Causality in Variance Approach to Assess the
Risk Dynamics Between Crude Oil Prices and the Jordanian Stock Market.”
Energy Policy 85 (October): 271–79. https://doi.org/10.1016/j.enpol.2015.0
6.001.
Bourtoule, Lucas, Varun Chandrasekaran, Christopher A. Choquette-Choo,
Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot.
2021. “Machine Unlearning.” In 2021 IEEE Symposium on Security and Privacy
(SP), 141–59. IEEE; IEEE. https://doi.org/10.1109/sp40001.2021.00019.
Bradbury, James, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris
Leary, Dougal Maclaurin, George Necula, et al. 2018. “JAX: Composable
Transformations of Python+NumPy Programs.” http://github.com/googl
e/jax.
Brain, Google. 2020. “XLA: Optimizing Compiler for Machine Learning.”
TensorFlow Blog. https://tensorflow.org/xla.
———. 2022. TensorFlow Documentation. https://www.tensorflow.org/.
Brakerski, Zvika et al. 2022. “Federated Learning and the Rise of Edge Intelli-
gence: Challenges and Opportunities.” Communications of the ACM 65 (8):
54–63.
Breck, Eric, Shanqing Cai, Eric Nielsen, Mohamed Salib, and D. Sculley. 2020.
“The ML Test Score: A Rubric for ML Production Readiness and Technical
Debt Reduction.” IEEE Transactions on Big Data 6 (2): 347–61.
Breier, Jakub, Xiaolu Hou, Dirmanto Jap, Lei Ma, Shivam Bhasin, and Yang Liu.
2018. “DeepLaser: Practical Fault Attack on Deep Neural Networks.” ArXiv
References 1633
Chen, Tianqi, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. “Training
Deep Nets with Sublinear Memory Cost.” CoRR abs/1604.06174 (April).
http://arxiv.org/abs/1604.06174v2.
Chen, Wei-Yu, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin
Huang. 2019. “A Closer Look at Few-Shot Classification.” In International
Conference on Learning Representations (ICLR).
Chen, Yu-Hsin, Joel Emer, and Vivienne Sze. 2017. “Eyeriss: A Spatial Archi-
tecture for Energy-EfÏcient Dataflow for Convolutional Neural Networks.”
IEEE Micro, 1–1. https://doi.org/10.1109/mm.2017.265085944.
Chen, Yu-Hsin, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2016. “Ey-
eriss: A Spatial Architecture for Energy-EfÏcient Dataflow for Convolu-
tional Neural Networks.” IEEE Journal of Solid-State Circuits 51 (1): 186–98.
https://doi.org/10.1109/JSSC.2015.2488709.
Chen, Zitao, Guanpeng Li, Karthik Pattabiraman, and Nathan DeBardeleben.
2019. “<I>BinFI</i>: An EfÏcient Fault Injector for Safety-Critical Machine
Learning Systems.” In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, 1–23. SC ’19. New
York, NY, USA: ACM. https://doi.org/10.1145/3295500.3356177.
Chen, Zitao, Niranjhana Narayanan, Bo Fang, Guanpeng Li, Karthik Pattabira-
man, and Nathan DeBardeleben. 2020. “TensorFI: A Flexible Fault Injection
Framework for TensorFlow Applications.” In 2020 IEEE 31st International
Symposium on Software Reliability Engineering (ISSRE), 426–35. IEEE; IEEE.
https://doi.org/10.1109/issre5003.2020.00047.
Cheng, Eric, Shahrzad Mirkhani, Lukasz G. Szafaryn, Chen-Yong Cher, Hyung-
min Cho, Kevin Skadron, Mircea R. Stan, et al. 2016. “CLEAR: <U>c</u>
Ross <u>-l</u> Ayer <u>e</u> Xploration for <u>a</u> Rchitecting
<u>r</u> Esilience - Combining Hardware and Software Techniques to
Tolerate Soft Errors in Processor Cores.” In Proceedings of the 53rd Annual
Design Automation Conference, 1–6. ACM. https://doi.org/10.1145/2897937.
2897996.
Cheng, Yu et al. 2022. “Memory-EfÏcient Deep Learning: Advances in Model
Compression and Sparsification.” ACM Computing Surveys.
Cheshire, David. 2021. “Circular Economy and Sustainable AI: Designing Out
Waste in the Tech Industry.” In The Handbook to Building a Circular Economy,
48–61. RIBA Publishing. https://doi.org/10.4324/9781003212775-8.
Chetlur, Sharan, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John
Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. “cuDNN: EfÏcient
Primitives for Deep Learning.” arXiv Preprint arXiv:1410.0759, October.
http://arxiv.org/abs/1410.0759v3.
Cho, Kyunghyun, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Ben-
gio. 2014. “On the Properties of Neural Machine Translation: Encoder-
Decoder Approaches.” In Eighth Workshop on Syntax, Semantics and Structure
in Statistical Translation (SSST-8), 103–11. Association for Computational
Linguistics.
Choi, Jungwook, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang,
Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. 2018. “PACT: Pa-
rameterized Clipping Activation for Quantized Neural Networks.” arXiv
Preprint, May. http://arxiv.org/abs/1805.06085v2.
References 1636
Choi, Sebin, and Sungmin Yoon. 2024. “GPT-Based Data-Driven Urban Building
Energy Modeling (GPT-UBEM): Concept, Methodology, and Case Studies.”
Energy and Buildings 325 (December): 115042. https://doi.org/10.1016/j.en
build.2024.115042.
Chollet, François et al. 2015. “Keras.” GitHub Repository. https://github.com/f
chollet/keras.
Chollet, François. 2018. “Introduction to Keras.” March 9th.
Choquette, Jack. 2023. “NVIDIA Hopper H100 GPU: Scaling Performance.”
IEEE Micro 43 (3): 9–17. https://doi.org/10.1109/mm.2023.3256796.
Choudhary, Tejalal, Vipul Mishra, Anurag Goswami, and Jagannathan Saranga-
pani. 2020. “A Comprehensive Survey on Model Compression and Acceler-
ation.” Artificial Intelligence Review 53 (7): 5113–55. https://doi.org/10.100
7/s10462-020-09816-7.
Chowdhery, Aakanksha, Anatoli Noy, Gaurav Misra, Zhuyun Dai, Quoc V. Le,
and Jeff Dean. 2021. “Edge TPU: An Edge-Optimized Inference Accelerator
for Deep Learning.” In International Symposium on Computer Architecture.
Christiano, Paul F., Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and
Dario Amodei. 2017. “Deep Reinforcement Learning from Human Pref-
erences.” In Advances in Neural Information Processing Systems 30: Annual
Conference on Neural Information Processing Systems 2017, December 4-9, 2017,
Long Beach, CA, USA, edited by Isabelle Guyon, Ulrike von Luxburg, Samy
Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman
Garnett, 4299–4307. https://proceedings.neurips.cc/paper/2017/hash/d
5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
Chu, Grace, Okan Arikan, Gabriel Bender, Weijun Wang, Achille Brighton,
Pieter-Jan Kindermans, Hanxiao Liu, Berkin Akin, Suyog Gupta, and An-
drew Howard. 2021. “Discovering Multi-Hardware Mobile Models via
Architecture Search.” In 2021 IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), 3016–25. IEEE. https://doi.org/10
.1109/cvprw53098.2021.00337.
Chung, Jae-Won, Yile Gu, Insu Jang, Luoxi Meng, Nikhil Bansal, and Mosharaf
Chowdhury. 2023. “Reducing Energy Bloat in Large Model Training.” ArXiv
Preprint abs/2312.06902 (December). http://arxiv.org/abs/2312.06902v3.
Ciez, Rebecca E., and J. F. Whitacre. 2019. “Examining Different Recycling
Processes for Lithium-Ion Batteries.” Nature Sustainability 2 (2): 148–56.
https://doi.org/10.1038/s41893-019-0222-5.
Coleman, Cody, Edward Chou, Julian Katz-Samuels, Sean Culatana, Peter Bailis,
Alexander C. Berg, Robert Nowak, Roshan Sumbaly, Matei Zaharia, and
I. Zeki Yalniz. 2022. “Similarity Search for EfÏcient Active Learning and
Search of Rare Concepts.” Proceedings of the AAAI Conference on Artificial
Intelligence 36 (6): 6402–10. https://doi.org/10.1609/aaai.v36i6.20591.
Commission, European. 2023. “Sustainable Digital Markets Act: Environmental
Transparency in AI.”
Contro, Filippo, Marco Crosara, Mariano Ceccato, and Mila Dalla Preda. 2021.
“EtherSolve: Computing an Accurate Control-Flow Graph from Ethereum
Bytecode.” arXiv Preprint arXiv:2103.09113, March. http://arxiv.org/abs/
2103.09113v1.
References 1637
Cooper, Tom, Suzanne Fallender, Joyann Pafumi, Jon Dettling, Sebastien Hum-
bert, and Lindsay Lessard. 2011. “A Semiconductor Company’s Examination
of Its Water Footprint Approach.” In Proceedings of the 2011 IEEE Interna-
tional Symposium on Sustainable Systems and Technology, 1–6. IEEE; IEEE.
https://doi.org/10.1109/issst.2011.5936865.
Cope, Gord. 2009. “Pure Water, Semiconductors and the Recession.” Global
Water Intelligence 10 (10).
Corporation, Intel. 2021. oneDNN: Intel’s Deep Learning Neural Network Library.
https://github.com/oneapi-src/oneDNN.
Corporation, NVIDIA. 2017. “GPU-Accelerated Machine Learning and Deep
Learning.” Technical Report.
———. 2021. NVIDIA cuDNN: GPU Accelerated Deep Learning. https://develo
per.nvidia.com/cudnn.
Corporation, Thinking Machines. 1992. CM-5 Technical Summary. Thinking
Machines Corporation.
Costa, Tiago, Chen Shi, Kevin Tien, and Kenneth L. Shepard. 2019. “A CMOS
2D Transmit Beamformer with Integrated PZT Ultrasound Transducers
for Neuromodulation.” In 2019 IEEE Custom Integrated Circuits Conference
(CICC), 1–4. IEEE. https://doi.org/10.1109/cicc.2019.8780236.
Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. 2016. “Bina-
ryConnect: Training Deep Neural Networks with Binary Weights During
Propagations.” Advances in Neural Information Processing Systems (NeurIPS)
28: 3123–31.
Courbariaux, Matthieu, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua
Bengio. 2016. “Binarized Neural Networks: Training Deep Neural Net-
works with Weights and Activations Constrained to +1 or -1.” arXiv Preprint
arXiv:1602.02830, February. http://arxiv.org/abs/1602.02830v3.
Crankshaw, Daniel, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gon-
zalez, and Ion Stoica. 2017. “Clipper: A {Low-Latency} Online Prediction
Serving System.” In 14th USENIX Symposium on Networked Systems Design
and Implementation (NSDI 17), 613–27.
Cui, Hongyi, Jiajun Li, and Peng et al. Xie. 2019. “A Survey on Machine
Learning Compilers: Taxonomy, Challenges, and Future Directions.” ACM
Computing Surveys 52 (4): 1–39.
Curnow, H. J. 1976. “A Synthetic Benchmark.” The Computer Journal 19 (1):
43–49. https://doi.org/10.1093/comjnl/19.1.43.
Cybenko, G. 1992. “Approximation by Superpositions of a Sigmoidal Function.”
Mathematics of Control, Signals, and Systems 5 (4): 455–55. https://doi.org/
10.1007/bf02134016.
Dally, William J., Stephen W. Keckler, and David B. Kirk. 2021. “Evolution
of the Graphics Processing Unit (GPU).” IEEE Micro 41 (6): 42–51. https:
//doi.org/10.1109/mm.2021.3113475.
Dao, Tri, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Gro-
gan, Alexander Liu, Aniruddh Rao, Atri Rudra, and Christopher Ré. 2022.
“Monarch: Expressive Structured Matrices for EfÏcient and Accurate Train-
ing,” April. http://arxiv.org/abs/2204.00595v1.
David, Robert, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian
Li, Nick Kreeger, et al. 2021. “Tensorflow Lite Micro: Embedded Machine
References 1638
Gou, Jianping, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. 2021.
“Knowledge Distillation: A Survey.” International Journal of Computer Vision
129 (6): 1789–819. https://doi.org/10.1007/s11263-021-01453-z.
Gräfe, Ralf, Qutub Syed Sha, Florian Geissler, and Michael Paulitsch. 2023.
“Large-Scale Application of Fault Injection into PyTorch Models -an Exten-
sion to PyTorchFI for Validation EfÏciency.” In 2023 53rd Annual IEEE/I-
FIP International Conference on Dependable Systems and Networks - Supplemen-
tal Volume (DSN-s), 56–62. IEEE; IEEE. https://doi.org/10.1109/dsn-
s58398.2023.00025.
Graphcore. 2020. “The Colossus MK2 IPU Processor.” Graphcore Technical Paper.
Groeneveld, Dirk, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney,
Oyvind Tafjord, Ananya Harsh Jha, et al. 2024. “OLMo: Accelerating the
Science of Language Models.” arXiv Preprint arXiv:2402.00838, February.
http://arxiv.org/abs/2402.00838v4.
Grossman, Elizabeth. 2007. High Tech Trash: Digital Devices, Hidden Toxics, and
Human Health. Island press.
Gu, Ivy. 2023. “Deep Learning Model Compression (Ii) by Ivy Gu Medium.”
https://ivygdy.medium.com/deep-learning-model-compression-ii-
546352ea9453.
Gudivada, Venkat N., Dhana Rao Rao, et al. 2017. “Data Quality Considerations
for Big Data and Machine Learning: Going Beyond Data Cleaning and
Transformations.” IEEE Transactions on Knowledge and Data Engineering.
Gujarati, Arpan, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann,
Ymir Vigfusson, and Jonathan Mace. 2020. “Serving DNNs Like Clock-
work: Performance Predictability from the Bottom Up.” In 14th USENIX
Symposium on Operating Systems Design and Implementation (OSDI 20), 443–62.
https://www.usenix.org/conference/osdi20/presentation/gujarati.
Gulshan, Varun, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunacha-
lam Narayanaswamy, Subhashini Venugopalan, et al. 2016. “Develop-
ment and Validation of a Deep Learning Algorithm for Detection of Dia-
betic Retinopathy in Retinal Fundus Photographs.” JAMA 316 (22): 2402.
https://doi.org/10.1001/jama.2016.17216.
Guo, Yutao, Hao Wang, Hui Zhang, Tong Liu, Zhaoguang Liang, Yunlong
Xia, Li Yan, et al. 2019. “Mobile Photoplethysmographic Technology to
Detect Atrial Fibrillation.” Journal of the American College of Cardiology 74 (19):
2365–75. https://doi.org/10.1016/j.jacc.2019.08.019.
Gupta, Suyog, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan.
2015. “Deep Learning with Limited Numerical Precision.” In International
Conference on Machine Learning, 1737–46. PMLR.
Gupta, Udit, Mariam Elgamal, Gage Hills, Gu-Yeon Wei, Hsien-Hsin S. Lee,
David Brooks, and Carole-Jean Wu. 2022. “ACT: Designing Sustainable
Computer Systems with an Architectural Carbon Modeling Tool.” In Pro-
ceedings of the 49th Annual International Symposium on Computer Architecture,
784–99. ACM. https://doi.org/10.1145/3470496.3527408.
Gupta, Udit, Young Geun Kim, Sylvia Lee, Jordan Tse, Hsien-Hsin S Lee, Gu-
Yeon Wei, David Brooks, and Carole-Jean Wu. 2022. “Chasing Carbon: The
Elusive Environmental Footprint of Computing.” IEEE Micro 42 (6): 68–78.
https://doi.org/10.1109/MM.2022.3186575.
References 1644
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015a. “Distilling the Knowledge
in a Neural Network.” arXiv Preprint arXiv:1503.02531, March. http://arxiv.
org/abs/1503.02531v1.
———. 2015b. “Distilling the Knowledge in a Neural Network.” arXiv Preprint
arXiv:1503.02531, March. http://arxiv.org/abs/1503.02531v1.
Hirschberg, Julia, and Christopher D. Manning. 2015. “Advances in Natural
Language Processing.” Science 349 (6245): 261–66. https://doi.org/10.112
6/science.aaa8685.
Hochreiter, Sepp. 1998. “The Vanishing Gradient Problem During Learning
Recurrent Neural Nets and Problem Solutions.” International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems 06 (02): 107–16. https:
//doi.org/10.1142/s0218488598000094.
Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.”
Neural Computation 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.
1735.
Hoefler, Torsten, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra
Peste. 2021. “Sparsity in Deep Learning: Pruning and Growth for EfÏcient
Inference and Training in Neural Networks.” arXiv Preprint arXiv:2102.00554
22 (January): 1–124. http://arxiv.org/abs/2102.00554v1.
Hoefler, Torsten, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandros
Nikolaos Ziogas. 2021. “Sparsity in Deep Learning: Pruning and Growth
for EfÏcient Inference and Training in Neural Networks.” Journal of Machine
Learning Research 22 (241): 1–124.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya,
Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. “Training
Compute-Optimal Large Language Models.” arXiv Preprint arXiv:2203.15556,
March. http://arxiv.org/abs/2203.15556v1.
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer
Feedforward Networks Are Universal Approximators.” Neural Networks 2
(5): 359–66. https://doi.org/10.1016/0893-6080(89)90020-8.
Horowitz, Mark. 2014. “1.1 Computing’s Energy Problem (and What We Can
Do about It).” In 2014 IEEE International Solid-State Circuits Conference Digest
of Technical Papers (ISSCC). IEEE. https://doi.org/10.1109/isscc.2014.67573
23.
Hosseini, Hossein, Sreeram Kannan, Baosen Zhang, and Radha Poovendran.
2017. “Deceiving Google’s Perspective API Built for Detecting Toxic Com-
ments.” ArXiv Preprint abs/1702.08138 (February). http://arxiv.org/abs/
1702.08138v1.
Houlsby, Neil, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Chloé
de Laroussilhe, Andrea Gesmundo, Mohammad Attariyan, and Sylvain
Gelly. 2019. “Parameter-EfÏcient Transfer Learning for NLP.” In International
Conference on Machine Learning, 2790–99. PMLR.
Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Wei-
jun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017a.
“MobileNets: EfÏcient Convolutional Neural Networks for Mobile Vision
Applications,” April. http://arxiv.org/abs/1704.04861v1.
References 1647
Jones, Nicholas P., Mark Johnson, and Claire Montgomery. 2021. “The Envi-
ronmental Impact of Data Centers: Challenges and Sustainable Solutions.”
Energy Reports 7: 4381–92.
Jordan, T. L. 1982. “A Guide to Parallel Computation and Some Cray-1 Experi-
ences.” In Parallel Computations, 1–50. Elsevier. https://doi.org/10.1016/b9
78-0-12-592101-5.50006-3.
Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017.
“Bag of Tricks for EfÏcient Text Classification.” In Proceedings of the 15th Con-
ference of the European Chapter of the Association for Computational Linguistics:
Volume 2, Short Papers, 18:1–42. Association for Computational Linguistics.
https://doi.org/10.18653/v1/e17-2068.
Jouppi, Norman P. et al. 2017. “In-Datacenter Performance Analysis of a Tensor
Processing Unit.” Proceedings of the 44th Annual International Symposium on
Computer Architecture (ISCA).
Jouppi, Norman P., Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas
B. Jablin, George Kurian, James Laudon, et al. 2021b. “Ten Lessons from
Three Generations Shaped Google’s TPUv4i : Industrial Product.” In 2021
ACM/IEEE 48th Annual International Symposium on Computer Architecture
(ISCA), 64:1–14. 5. IEEE. https://doi.org/10.1109/isca52012.2021.00010.
———, et al. 2021a. “Ten Lessons from Three Generations Shaped Google’s
TPUv4i : Industrial Product.” In 2021 ACM/IEEE 48th Annual International
Symposium on Computer Architecture (ISCA), 1–14. IEEE. https://doi.org/10
.1109/isca52012.2021.00010.
Jouppi, Norman P., Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil,
James Laudon, Cliff Young, and David Patterson. 2020. “A Domain-Specific
Supercomputer for Training Deep Neural Networks.” Communications of the
ACM 63 (7): 67–78. https://doi.org/10.1145/3360307.
Jouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
Raminder Bajwa, Sarah Bates, et al. 2017a. “In-Datacenter Performance
Analysis of a Tensor Processing Unit.” In Proceedings of the 44th Annual
International Symposium on Computer Architecture, 1–12. ACM. https://doi.
org/10.1145/3079856.3080246.
———, et al. 2017c. “In-Datacenter Performance Analysis of a Tensor Processing
Unit.” In Proceedings of the 44th Annual International Symposium on Computer
Architecture, 1–12. ACM. https://doi.org/10.1145/3079856.3080246.
———, et al. 2017b. “In-Datacenter Performance Analysis of a Tensor Process-
ing Unit.” In Proceedings of the 44th Annual International Symposium on Com-
puter Architecture, 1–12. ACM. https://doi.org/10.1145/3079856.3080246.
Joye, Marc, and Michael Tunstall. 2012. Fault Analysis in Cryptography. Springer
Berlin Heidelberg. https://doi.org/10.1007/978-3-642-29656-7.
Kannan, Harish, Pradeep Dubey, and Mark Horowitz. 2023. “Chiplet-Based
Architectures: The Future of AI Accelerators.” IEEE Micro 43 (1): 46–55.
https://doi.org/10.1109/MM.2022.1234567.
Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin
Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario
Amodei. 2020. “Scaling Laws for Neural Language Models.” ArXiv Preprint
abs/2001.08361 (January). http://arxiv.org/abs/2001.08361v1.
References 1651
Kawazoe Aguilera, Marcos, Wei Chen, and Sam Toueg. 1997. “Heartbeat:
A Timeout-Free Failure Detector for Quiescent Reliable Communication.”
In Distributed Algorithms, 126–40. Springer; Springer Berlin Heidelberg.
https://doi.org/10.1007/bfb0030680.
Kiela, Douwe, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger,
Zhengxuan Wu, Bertie Vidgen, et al. 2021. “Dynabench: Rethinking Bench-
marking in NLP.” In Proceedings of the 2021 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Language
Technologies, 9:418–34. Online: Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021.naacl-main.324.
Kim, Jungrae, Michael Sullivan, and Mattan Erez. 2015. “Bamboo ECC: Strong,
Safe, and Flexible Codes for Reliable Computer Memory.” In 2015 IEEE 21st
International Symposium on High Performance Computer Architecture (HPCA),
101–12. IEEE; IEEE. https://doi.org/10.1109/hpca.2015.7056025.
Kim, Sunju, Chungsik Yoon, Seunghon Ham, Jihoon Park, Ohun Kwon, Donguk
Park, Sangjun Choi, Seungwon Kim, Kwonchul Ha, and Won Kim. 2018.
“Chemical Use in the Semiconductor Manufacturing Industry.” International
Journal of Occupational and Environmental Health 24 (3-4): 109–18. https:
//doi.org/10.1080/10773525.2018.1519957.
Kingma, Diederik P., and Jimmy Ba. 2014. “Adam: A Method for Stochastic
Optimization.” ICLR, December. http://arxiv.org/abs/1412.6980v9.
Kirkpatrick, James, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume
Desjardins, Andrei A. Rusu, Kieran Milan, et al. 2017. “Overcoming Catas-
trophic Forgetting in Neural Networks.” Proceedings of the National Academy
of Sciences 114 (13): 3521–26. https://doi.org/10.1073/pnas.1611835114.
Kleppmann, Martin. 2016. Designing Data-Intensive Applications: The Big Ideas
Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media. http:
//shop.oreilly.com/product/0636920032175.do.
Ko, Yohan. 2021. “Characterizing System-Level Masking Effects Against Soft
Errors.” Electronics 10 (18): 2286. https://doi.org/10.3390/electronics10182
286.
Kocher, Paul, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner
Haas, Mike Hamburg, et al. 2019b. “Spectre Attacks: Exploiting Speculative
Execution.” In 2019 IEEE Symposium on Security and Privacy (SP), 1–19. IEEE.
https://doi.org/10.1109/sp.2019.00002.
———, et al. 2019a. “Spectre Attacks: Exploiting Speculative Execution.”
In 2019 IEEE Symposium on Security and Privacy (SP), 1–19. IEEE. https:
//doi.org/10.1109/sp.2019.00002.
Kocher, Paul, Joshua Jaffe, and Benjamin Jun. 1999. “Differential Power Analy-
sis.” In Advances in Cryptology — CRYPTO’ 99, 388–97. Springer; Springer
Berlin Heidelberg. https://doi.org/10.1007/3-540-48405-1/_25.
Kocher, Paul, Joshua Jaffe, Benjamin Jun, and Pankaj Rohatgi. 2011. “Introduc-
tion to Differential Power Analysis.” Journal of Cryptographic Engineering 1
(1): 5–27. https://doi.org/10.1007/s13389-011-0006-y.
Koh, Pang Wei, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma
Pierson, Been Kim, and Percy Liang. 2020. “Concept Bottleneck Models.” In
Proceedings of the 37th International Conference on Machine Learning, ICML 2020,
References 1652
Lin, Jiong, Qing Gao, Yungui Gong, Yizhou Lu, Chao Zhang, and Fengge Zhang.
2020. “Primordial Black Holes and Secondary Gravitational Waves from
k/g Inflation.” arXiv Preprint arXiv:2001.05909, January. http://arxiv.org/
abs/2001.05909v2.
Lin, Ji, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen
Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2023.
“AWQ: Activation-Aware Weight Quantization for LLM Compression and
Acceleration.” arXiv Preprint arXiv:2306.00978 abs/2306.00978 (June). http:
//arxiv.org/abs/2306.00978v5.
Lin, Ji, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, and Song Han. 2023.
“Tiny Machine Learning: Progress and Futures [Feature].” IEEE Circuits and
Systems Magazine 23 (3): 8–34. https://doi.org/10.1109/mcas.2023.3302182.
Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. “Microsoft COCO:
Common Objects in Context.” In Computer Vision – ECCV 2014, 740–55.
Springer; Springer International Publishing. https://doi.org/10.1007/978-
3-319-10602-1/_48.
Lindgren, Simon. 2023. Handbook of Critical Studies of Artificial Intelligence.
Edward Elgar Publishing.
Lindholm, Andreas, Dave Zachariah, Petre Stoica, and Thomas B. Schon. 2019.
“Data Consistency Approach to Model Validation.” IEEE Access 7: 59788–96.
https://doi.org/10.1109/access.2019.2915109.
Lindholm, Erik, John Nickolls, Stuart Oberman, and John Montrym. 2008.
“NVIDIA Tesla: A Unified Graphics and Computing Architecture.” IEEE
Micro 28 (2): 39–55. https://doi.org/10.1109/mm.2008.31.
Liu, Chen, Guillaume Bellec, Bernhard Vogginger, David Kappel, Johannes
Partzsch, Felix Neumärker, Sebastian Höppner, et al. 2018. “Memory-
EfÏcient Deep Learning on a SpiNNaker 2 Prototype.” Frontiers in Neu-
roscience 12 (November): 840. https://doi.org/10.3389/fnins.2018.00840.
Liu, Yanan, Xiaoxia Wei, Jinyu Xiao, Zhijie Liu, Yang Xu, and Yun Tian. 2020.
“Energy Consumption and Emission Mitigation Prediction Based on Data
Center TrafÏc and PUE for Global Data Centers.” Global Energy Interconnec-
tion 3 (3): 272–82. https://doi.org/10.1016/j.gloei.2020.07.008.
Liu, Yingcheng, Guo Zhang, Christopher G. Tarolli, Rumen Hristov, Stella
Jensen-Roberts, Emma M. Waddell, Taylor L. Myers, et al. 2022. “Monitoring
Gait at Home with Radio Waves in Parkinson’s Disease: A Marker of Severity,
Progression, and Medication Response.” Science Translational Medicine 14
(663): eadc9669. https://doi.org/10.1126/scitranslmed.adc9669.
Lopez-Paz, David, and Marc’Aurelio Ranzato. 2017. “Gradient Episodic Mem-
ory for Continual Learning.” In NIPS, 30:6467–76. https://proceeding
s.neurips.cc/paper/2017/hash/f87522788a2be2d171666752f97ddebb-
Abstract.html.
Lu, Yucheng, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov, Christo-
pher De Sa, and Amir Yazdanbakhsh. 2023. “STEP: Learning n:m Struc-
tured Sparsity Masks from Scratch with Precondition,” February. http:
//arxiv.org/abs/2302.01172v1.
Luna, William Fernando Martı́nez. 2018a. “CONSUMER PROTECTION AGAINST
PLANNED OBSOLESCENCE. AN INTERNATIONAL PRIVATE LAW ANAL-
References 1656
McMahan, H Brendan, Eider Moore, Daniel Ramage, Seth Hampson, et al. 2017.
“Communication-EfÏcient Learning of Deep Networks from Decentralized
Data.” In Proceedings of the 20th International Conference on Artificial Intelligence
and Statistics (AISTATS), 1273–82.
Mellempudi, Naveen, Sudarshan Srinivasan, Dipankar Das, and Bharat Kaul.
2019. “Mixed Precision Training with 8-Bit Floating Point.” arXiv Preprint
arXiv:1905.12334, May. http://arxiv.org/abs/1905.12334v1.
Merity, Stephen, Caiming Xiong, James Bradbury, and Richard Socher. 2016.
“Pointer Sentinel Mixture Models.” arXiv Preprint arXiv:1609.07843, Septem-
ber. http://arxiv.org/abs/1609.07843v1.
Micikevicius, Paulius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich
Elsen, David Garcia, Boris Ginsburg, et al. 2017b. “Mixed Precision Train-
ing.” arXiv Preprint arXiv:1710.03740, October. http://arxiv.org/abs/1710
.03740v3.
———, et al. 2017a. “Mixed Precision Training.” arXiv Preprint arXiv:1710.03740,
October. http://arxiv.org/abs/1710.03740v3.
Micikevicius, Paulius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep
Dubey, Richard Grisenthwaite, Sangwon Ha, et al. 2022. “FP8 Formats for
Deep Learning.” arXiv Preprint arXiv:2209.05433, September. http://arxiv.
org/abs/2209.05433v2.
Miller, Charlie. 2019. “Lessons Learned from Hacking a Car.” IEEE Design
&Amp; Test 36 (6): 7–9. https://doi.org/10.1109/mdat.2018.2863106.
Miller, Charlie, and Chris Valasek. 2015. “The Antivirus Hacker’s Handbook.”
Black Hat USA. Wiley. https://doi.org/10.1002/9781119183525.ch15.
Mills, Andrew, and Stephen Le Hunte. 1997. “An Overview of Semiconductor
Photocatalysis.” Journal of Photochemistry and Photobiology A: Chemistry 108
(1): 1–35. https://doi.org/10.1016/s1010-6030(97)00118-4.
Mirhoseini, Azalia et al. 2017. “Device Placement Optimization with Reinforce-
ment Learning.” International Conference on Machine Learning (ICML).
Mohanram, K., and N. A. Touba. n.d. “Partial Error Masking to Reduce Soft
Error Failure Rate in Logic Circuits.” In Proceedings. 16th IEEE Symposium
on Computer Arithmetic, 433–40. IEEE; IEEE Comput. Soc. https://doi.org/
10.1109/dftvs.2003.1250141.
Moore, Gordon. 2021. “Cramming More Components onto Integrated Circuits
(1965).” In Ideas That Created the Future, 261–66. The MIT Press. https:
//doi.org/10.7551/mitpress/12274.003.0027.
Mukherjee, S. S., J. Emer, and S. K. Reinhardt. n.d. “The Soft Error Problem:
An Architectural Perspective.” In 11th International Symposium on High-
Performance Computer Architecture, 243–47. IEEE; IEEE. https://doi.org/10.1
109/hpca.2005.37.
Nagel, Markus, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko,
Mart van Baalen, and Tijmen Blankevoort. 2021b. “A White Paper on
Neural Network Quantization.” arXiv Preprint arXiv:2106.08295, June. http:
//arxiv.org/abs/2106.08295v1.
———. 2021a. “A White Paper on Neural Network Quantization.” arXiv
Preprint arXiv:2106.08295, June. http://arxiv.org/abs/2106.08295v1.
Narang, Sharan, Hyung Won Chung, Yi Tay, Liam Fedus, Thibault Fevry,
Michael Matena, Karishma Malkan, et al. 2021. “Do Transformer Mod-
References 1659
Parrish, Alicia, Hannah Rose Kirk, Jessica Quaye, Charvi Rastogi, Max Bartolo,
Oana Inel, Juan Ciro, et al. 2023. “Adversarial Nibbler: A Data-Centric
Challenge for Improving the Safety of Text-to-Image Models.” ArXiv Preprint
abs/2305.14384 (May). http://arxiv.org/abs/2305.14384v1.
Paszke, Adam, Sam Gross, Francisco Massa, and et al. 2019. “PyTorch: An
Imperative Style, High-Performance Deep Learning Library.” Advances in
Neural Information Processing Systems (NeurIPS) 32: 8026–37.
Patel, Paresh D., Absar Lakdawala, Sajan Chourasia, and Rajesh N. Patel. 2016.
“Bio Fuels for Compression Ignition Engine: A Review on Engine Perfor-
mance, Emission and Life Cycle Analysis.” Renewable and Sustainable Energy
Reviews 65 (November): 24–43. https://doi.org/10.1016/j.rser.2016.06.010.
Patterson, David A., and John L. Hennessy. 2021a. Computer Architecture: A
Quantitative Approach. 6th ed. Morgan Kaufmann.
———. 2021b. Computer Organization and Design RISC-v Edition: The Hardware
Software Interface. 2nd ed. San Francisco, CA: Morgan Kaufmann.
———. 2021c. Computer Organization and Design: The Hardware/Software Interface.
5th ed. Morgan Kaufmann.
Patterson, David, Joseph Gonzalez, Urs Holzle, Quoc Le, Chen Liang, Lluis-
Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean.
2022. “The Carbon Footprint of Machine Learning Training Will Plateau,
Then Shrink.” Computer 55 (7): 18–28. https://doi.org/10.1109/mc.2022.31
48714.
Patterson, David, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia,
Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021a. “Carbon
Emissions and Large Neural Network Training.” arXiv Preprint arXiv:2104.10350,
April. http://arxiv.org/abs/2104.10350v3.
———. 2021b. “Carbon Emissions and Large Neural Network Training.” arXiv
Preprint arXiv:2104.10350, April. http://arxiv.org/abs/2104.10350v3.
Patterson, David, Joseph Gonzalez, Quoc Le, Maud Texier, and Jeff Dean. 2022.
“Carbon-Aware Computing for Sustainable AI.” Communications of the ACM
65 (11): 50–58.
Penedo, Guilherme, Hynek Kydlı́ček, Loubna Ben allal, Anton Lozhkov, Mar-
garet Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024.
“The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale.”
arXiv Preprint arXiv:2406.17557, June. http://arxiv.org/abs/2406.17557v2.
Peters, Dorian, Rafael A. Calvo, and Richard M. Ryan. 2018. “Designing for
Motivation, Engagement and Wellbeing in Digital Experience.” Frontiers in
Psychology 9 (May): 797. https://doi.org/10.3389/fpsyg.2018.00797.
Phillips, P. Jonathon, Carina A. Hahn, Peter C. Fontana, David A. Broniatowski,
and Mark A. Przybocki. 2020. “Four Principles of Explainable Artificial
Intelligence.” Gaithersburg, Maryland. National Institute of Standards; Tech-
nology (NIST). https://doi.org/10.6028/nist.ir.8312-draft.
Pineau, Joelle, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière,
Alina Beygelzimer, Florence d’Alché-Buc, Emily Fox, and Hugo Larochelle.
2021. “Improving Reproducibility in Machine Learning Research (a Re-
port from the Neurips 2019 Reproducibility Program).” Journal of Machine
Learning Research 22 (164): 1–20.
References 1662
Reagen, Brandon, Udit Gupta, Lillian Pentecost, Paul Whatmough, Sae Kyu
Lee, Niamh Mulholland, David Brooks, and Gu-Yeon Wei. 2018. “Ares: A
Framework for Quantifying the Resilience of Deep Neural Networks.” In
2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), 1–6. IEEE.
https://doi.org/10.1109/dac.2018.8465834.
Real, Esteban, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019b. “Regu-
larized Evolution for Image Classifier Architecture Search.” Proceedings
of the AAAI Conference on Artificial Intelligence 33 (01): 4780–89. https:
//doi.org/10.1609/aaai.v33i01.33014780.
———. 2019a. “Regularized Evolution for Image Classifier Architecture Search.”
Proceedings of the AAAI Conference on Artificial Intelligence 33 (01): 4780–89.
https://doi.org/10.1609/aaai.v33i01.33014780.
RebufÏ, Sylvestre-Alvise, Hakan Bilen, and Andrea Vedaldi. 2017. “Learning
Multiple Visual Domains with Residual Adapters.” In Advances in Neural
Information Processing Systems. Vol. 30.
Reddi, Vijay Janapa, Christine Cheng, David Kanter, Peter Mattson, Guenther
Schmuelling, Carole-Jean Wu, Brian Anderson, et al. 2019. “MLPerf In-
ference Benchmark.” arXiv Preprint arXiv:1911.02549, November, 446–59.
https://doi.org/10.1109/isca45697.2020.00045.
Reddi, Vijay Janapa, and Meeta Sharma Gupta. 2013. Resilient Architecture
Design for Voltage Variation. Springer International Publishing. https://doi.
org/10.1007/978-3-031-01739-1.
Reis, G. A., J. Chang, N. Vachharajani, R. Rangan, and D. I. August. n.d. “SWIFT:
Software Implemented Fault Tolerance.” In International Symposium on Code
Generation and Optimization, 243–54. IEEE; IEEE. https://doi.org/10.1109/
cgo.2005.34.
Research, Microsoft. 2021. DeepSpeed: Extreme-Scale Model Training for Everyone.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “” Why Should
i Trust You?” Explaining the Predictions of Any Classifier.” In Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 1135–44.
Richter, Joel D., and Xinyu Zhao. 2021. “The Molecular Biology of FMRP: New
Insights into Fragile x Syndrome.” Nature Reviews Neuroscience 22 (4): 209–22.
https://doi.org/10.1038/s41583-021-00432-0.
Robertson, J., and M. Riley. 2018. “The Big Hack: How China Used a Tiny Chip
to Infiltrate u.s. Companies - Bloomberg.”https://www.bloomberg.com/news/fea-
tures/2018-10-04/the-big-hack-how-china-used-a-tiny-chip-to-infiltrate-america-
s-top-companies .
Rodio, Angelo, and Giovanni Neglia. 2024. “FedStale: Leveraging Stale Client
Updates in Federated Learning,” May. http://arxiv.org/abs/2405.04171v1.
Rolnick, David, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Greg
Wayne. 2019. “Experience Replay for Continual Learning.” In Advances in
Neural Information Processing Systems (NeurIPS).
Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn
Ommer. 2022. “High-Resolution Image Synthesis with Latent Diffusion
Models.” In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), 10674–85. IEEE. https://doi.org/10.1109/cvpr52688.2022.0
1042.
References 1665
Shan, Shawn, Wenxin Ding, Josephine Passananti, Stanley Wu, Haitao Zheng,
and Ben Y. Zhao. 2023. “Nightshade: Prompt-Specific Poisoning Attacks on
Text-to-Image Generative Models.” ArXiv Preprint abs/2310.13828 (October).
http://arxiv.org/abs/2310.13828v3.
Shang, J., G. Wang, and Y. Liu. 2018. “Accelerating Genomic Data Analysis
with Domain-Specific Architectures.” IEEE Transactions on Computers 67 (7):
965–78. https://doi.org/10.1109/TC.2018.2799212.
Sharma, Amit. 2020. “Industrial AI and Vendor Lock-in: The Hidden Costs of
Proprietary Ecosystems.” AI and Industry Review 8 (3): 55–70.
Shazeer, Noam, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Pen-
porn Koanantakool, Peter Hawkins, et al. 2018. “Mesh-TensorFlow: Deep
Learning for Supercomputers.” arXiv Preprint arXiv:1811.02084, November.
http://arxiv.org/abs/1811.02084v1.
Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc
Le, Geoffrey Hinton, and Jeff Dean. 2017. “Outrageously Large Neural
Networks: The Sparsely-Gated Mixture-of-Experts Layer.” arXiv Preprint
arXiv:1701.06538, January. http://arxiv.org/abs/1701.06538v1.
Shazeer, Noam, Azalia Mirhoseini, Piotr Maziarz, et al. 2017. “Outrageously
Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” In
International Conference on Learning Representations.
Sheaffer, Jeremy W., David P. Luebke, and Kevin Skadron. 2007. “A Hardware
Redundancy and Recovery Mechanism for Reliable Scientific Computation
on Graphics Processors.” In Graphics Hardware, 2007:55–64. Citeseer. https:
//doi.org/10.2312/EGGH/EGGH07/055-064.
Shen, Sheng, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami,
Michael W. Mahoney, and Kurt Keutzer. 2019. “Q-BERT: Hessian Based Ul-
tra Low Precision Quantization of BERT.” Proceedings of the AAAI Conference
on Artificial Intelligence 34 (05): 8815–21. https://doi.org/10.1609/aaai.v34
i05.6409.
Sheng, Victor S., and Jing Zhang. 2019. “Machine Learning with Crowd-
sourcing: A Brief Summary of the Past Research and Future Directions.”
Proceedings of the AAAI Conference on Artificial Intelligence 33 (01): 9837–43.
https://doi.org/10.1609/aaai.v33i01.33019837.
Shneiderman, Ben. 2020. “Bridging the Gap Between Ethics and Practice:
Guidelines for Reliable, Safe, and Trustworthy Human-Centered AI Sys-
tems.” ACM Transactions on Interactive Intelligent Systems 10 (4): 1–31. https:
//doi.org/10.1145/3419764.
———. 2022. Human-Centered AI. Oxford University Press.
Shoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared
Casper, and Bryan Catanzaro. 2019a. “Megatron-LM: Training Multi-Billion
Parameter Language Models Using Model Parallelism.” arXiv Preprint
arXiv:1909.08053, September. http://arxiv.org/abs/1909.08053v4.
———. 2019b. “Megatron-LM: Training Multi-Billion Parameter Language
Models Using Model Parallelism.” arXiv Preprint arXiv:1909.08053, Septem-
ber. http://arxiv.org/abs/1909.08053v4.
Shokri, Reza, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017.
“Membership Inference Attacks Against Machine Learning Models.” In
References 1668
2017 IEEE Symposium on Security and Privacy (SP), 3–18. IEEE; IEEE. https:
//doi.org/10.1109/sp.2017.41.
Singh, Narendra, and Oladele A. Ogunseitan. 2022. “Disentangling the World-
wide Web of e-Waste and Climate Change Co-Benefits.” Circular Economy 1
(2): 100011. https://doi.org/10.1016/j.cec.2022.100011.
Skorobogatov, Sergei. 2009. “Local Heating Attacks on Flash Memory Devices.”
In 2009 IEEE International Workshop on Hardware-Oriented Security and Trust,
1–6. IEEE; IEEE. https://doi.org/10.1109/hst.2009.5225028.
Skorobogatov, Sergei P., and Ross J. Anderson. 2003. “Optical Fault Induction
Attacks.” In Cryptographic Hardware and Embedded Systems - CHES 2002, 2–12.
Springer; Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-
36400-5/_2.
Slade, Giles. 2007. Made to Break: Technology and Obsolescence in America. Har-
vard University Press. https://doi.org/10.4159/9780674043756.
Smith, Steven W. 1997. The Scientist and Engineer’s Guide to Digital Signal Process-
ing. California Technical Publishing. https://www.dspguide.com/.
Sodani, Avinash. 2015. “Knights Landing (KNL): 2nd Generation Intel® Xeon
Phi Processor.” In 2015 IEEE Hot Chips 27 Symposium (HCS), 1–24. IEEE.
https://doi.org/10.1109/hotchips.2015.7477467.
Sokolova, Marina, and Guy Lapalme. 2009. “A Systematic Analysis of Per-
formance Measures for Classification Tasks.” Information Processing &Amp;
Management 45 (4): 427–37. https://doi.org/10.1016/j.ipm.2009.03.002.
Stahel, Walter R. 2016. “The Circular Economy.” Nature 531 (7595): 435–38.
https://doi.org/10.1038/531435a.
Statista. 2022. “Number of Internet of Things (IoT) Connected Devices World-
wide from 2019 to 2030.” https://www.statista.com/statistics/802690/w
orldwide-connected-devices-by-access-technology/.
Stephens, Nigel, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole,
Giacomo Gabrielli, Matt Horsnell, et al. 2017. “The ARM Scalable Vector
Extension.” IEEE Micro 37 (2): 26–39. https://doi.org/10.1109/mm.2017.35.
Strassen, Volker. 1969. “Gaussian Elimination Is Not Optimal.” Numerische
Mathematik 13 (4): 354–56. https://doi.org/10.1007/bf02165411.
Strickland, Eliza. 2019. “IBM Watson, Heal Thyself: How IBM Overpromised
and Underdelivered on AI Health Care.” IEEE Spectrum 56 (4): 24–31. https:
//doi.org/10.1109/mspec.2019.8678513.
Strubell, Emma, Ananya Ganesh, and Andrew McCallum. 2019a. “Energy
and Policy Considerations for Deep Learning in NLP.” In Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics, 3645–50.
Association for Computational Linguistics. https://doi.org/10.18653/v1/
p19-1355.
———. 2019b. “Energy and Policy Considerations for Deep Learning in NLP.”
arXiv Preprint arXiv:1906.02243, June, 3645–50. https://doi.org/10.18653/v
1/p19-1355.
Sudhakar, Soumya, Vivienne Sze, and Sertac Karaman. 2023. “Data Centers on
Wheels: Emissions from Computing Onboard Autonomous Vehicles.” IEEE
Micro 43 (1): 29–39. https://doi.org/10.1109/mm.2022.3219803.
Sullivan, Gary J., Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. 2012.
“Overview of the High EfÏciency Video Coding (HEVC) Standard.” IEEE
References 1669
Wang, Tianlu, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez.
2019. “Balanced Datasets Are Not Enough: Estimating and Mitigating
Gender Bias in Deep Image Representations.” In 2019 IEEE/CVF International
Conference on Computer Vision (ICCV), 5309–18. IEEE. https://doi.org/10.1
109/iccv.2019.00541.
Wang, Xin, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez. 2018.
“SkipNet: Learning Dynamic Routing in Convolutional Networks.” In
Computer Vision – ECCV 2018, 420–36. Springer; Springer International
Publishing. https://doi.org/10.1007/978-3-030-01261-8/_25.
Wang, Yaqing, Quanming Yao, James T. Kwok, and Lionel M. Ni. 2020. “Gen-
eralizing from a Few Examples: A Survey on Few-Shot Learning.” ACM
Computing Surveys 53 (3): 1–34. https://doi.org/10.1145/3386252.
Wang, Y., and P. Kanwar. 2019. “BFloat16: The Secret to High Performance on
Cloud TPUs.” Google Cloud Blog.
Wang, Yu Emma, Gu-Yeon Wei, and David Brooks. 2019. “Benchmarking TPU,
GPU, and CPU Platforms for Deep Learning.” arXiv Preprint arXiv:1907.10701,
July. http://arxiv.org/abs/1907.10701v4.
Warden, Pete. 2018. “Speech Commands: A Dataset for Limited-Vocabulary
Speech Recognition.” arXiv Preprint arXiv:1804.03209, April. http://arxiv.
org/abs/1804.03209v1.
Weicker, Reinhold P. 1984. “Dhrystone: A Synthetic Systems Programming
Benchmark.” Communications of the ACM 27 (10): 1013–30. https://doi.org/
10.1145/358274.358283.
Werchniak, Andrew, Roberto Barra Chicote, Yuriy Mishchenko, Jasha Droppo,
Jeff Condal, Peng Liu, and Anish Shah. 2021. “Exploring the Application of
Synthetic Audio in Training Keyword Spotters.” In ICASSP 2021 - 2021 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
7993–96. IEEE; IEEE. https://doi.org/10.1109/icassp39728.2021.9413448.
Wiener, Norbert. 1960. “Some Moral and Technical Consequences of Au-
tomation: As Machines Learn They May Develop Unforeseen Strategies
at Rates That BafÒe Their Programmers.” Science 131 (3410): 1355–58.
https://doi.org/10.1126/science.131.3410.1355.
Wilkening, Mark, Vilas Sridharan, Si Li, Fritz Previlon, Sudhanva Gurumurthi,
and David R. Kaeli. 2014. “Calculating Architectural Vulnerability Factors
for Spatial Multi-Bit Transient Faults.” In 2014 47th Annual IEEE/ACM
International Symposium on Microarchitecture, 293–305. IEEE; IEEE. https:
//doi.org/10.1109/micro.2014.15.
Witten, Ian H., and Eibe Frank. 2002. “Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations.” ACM SIGMOD Record
31 (1): 76–77. https://doi.org/10.1145/507338.507355.
Wolpert, D. H., and W. G. Macready. 1997. “No Free Lunch Theorems for
Optimization.” IEEE Transactions on Evolutionary Computation 1 (1): 67–82.
https://doi.org/10.1109/4235.585893.
Wu, Bichen, Kurt Keutzer, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei
Sun, Yiming Wu, Yuandong Tian, Peter Vajda, and Yangqing Jia. 2019. “FB-
Net: Hardware-Aware EfÏcient ConvNet Design via Differentiable Neural
Architecture Search.” In 2019 IEEE/CVF Conference on Computer Vision and
References 1673
Yosinski, Jason, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. “How
Transferable Are Features in Deep Neural Networks?” Advances in Neural
Information Processing Systems 27.
You, Jie, Jae-Won Chung, and Mosharaf Chowdhury. 2023. “Zeus: Understand-
ing and Optimizing GPU Energy Consumption of DNN Training.” In 20th
USENIX Symposium on Networked Systems Design and Implementation (NSDI
23), 119–39. Boston, MA: USENIX Association. https://www.usenix.org/c
onference/nsdi23/presentation/you.
Yu, Jun, Peng Li, and Zhenhua Wang. 2023. “EfÏcient Early Exiting Strategies
for Neural Network Acceleration.” IEEE Transactions on Neural Networks and
Learning Systems.
Zafrir, Ofir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. “Q8BERT:
Quantized 8Bit BERT.” In 2019 Fifth Workshop on Energy EfÏcient Machine
Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS), 36–39.
IEEE; IEEE. https://doi.org/10.1109/emc2-nips53020.2019.00016.
Zaharia, Matei, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong,
Andy Konwinski, Corey Murching, et al. 2018. “Accelerating the Machine
Learning Lifecycle with MLflow.” Databricks.
Zeghidour, Neil, Olivier Teboul, Félix de Chaumont Quitry, and Marco Tagliasac-
chi. 2021. “LEAF: A Learnable Frontend for Audio Classification.” arXiv
Preprint arXiv:2101.08596, January. http://arxiv.org/abs/2101.08596v1.
Zhan, Ruiting, Zachary Oldenburg, and Lei Pan. 2018. “Recovery of Active
Cathode Materials from Lithium-Ion Batteries Using Froth Flotation.” Sus-
tainable Materials and Technologies 17 (September): e00062. https://doi.org/
10.1016/j.susmat.2018.e00062.
Zhang, Chengliang, Minchen Yu, Wei Wang 0030, and Feng Yan 0001. 2019.
“MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine
Learning Inference Serving.” In 2019 USENIX Annual Technical Conference
(USENIX ATC 19), 1049–62. https://www.usenix.org/conference/atc19/
presentation/zhang-chengliang.
Zhang, Jeff Jun, Tianyu Gu, Kanad Basu, and Siddharth Garg. 2018. “Analyzing
and Mitigating the Impact of Permanent Faults on a Systolic Array Based
Neural Network Accelerator.” In 2018 IEEE 36th VLSI Test Symposium (VTS),
1–6. IEEE; IEEE. https://doi.org/10.1109/vts.2018.8368656.
Zhang, Jeff, Kartheek Rangineni, Zahra Ghodsi, and Siddharth Garg. 2018.
“ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error
Resilience for Energy EfÏcient Deep Learning Accelerators.” In 2018 55th
ACM/ESDA/IEEE Design Automation Conference (DAC), 1–6. IEEE. https:
//doi.org/10.1109/dac.2018.8465918.
Zhang, Qingxue, Dian Zhou, and Xuan Zeng. 2017. “Highly Wearable Cuff-Less
Blood Pressure and Heart Rate Monitoring with Single-Arm Electrocardio-
gram and Photoplethysmogram Signals.” BioMedical Engineering OnLine 16
(1): 23. https://doi.org/10.1186/s12938-017-0317-z.
Zhang, Xitong, Jialin Song, and Dacheng Tao. 2020. “EfÏcient Task-Specific
Adaptation for Deep Models.” In International Conference on Learning Repre-
sentations (ICLR).
Zhang, Yi, Jianlei Yang, Linghao Song, Yiyu Shi, Yu Wang, and Yuan Xie. 2021.
“Learning-Based EfÏcient Sparsity and Quantization for Neural Network
References 1675