Machine Learning Systems
Machine Learning Systems
Systems Vijay
Janapa Reddi
Machine Learning Systems
Table of contents
Preface i
Why We Wrote This Book . . . . . . . . . . . . . . . . . . . . i
What You’ll Need to Know . . . . . . . . . . . . . . . . . . . ii
Content Transparency Statement . . . . . . . . . . . . . . . . ii
Want to Help Out? . . . . . . . . . . . . . . . . . . . . . . . . ii
Get in Touch . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Copyright . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements v
Funding Agencies and Companies . . . . . . . . . . . . . . . v
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
1 Introduction 1
Table of contents ii
2 ML Systems 33
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2 Cloud ML . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.1 Characteristics . . . . . . . . . . . . . . . . . . . 37
2.2.2 Benefits . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.3 Challenges . . . . . . . . . . . . . . . . . . . . . 40
2.2.4 Example Use Cases . . . . . . . . . . . . . . . . 42
2.3 Edge ML . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.1 Characteristics . . . . . . . . . . . . . . . . . . . 43
2.3.2 Benefits . . . . . . . . . . . . . . . . . . . . . . . 45
2.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . 45
2.3.4 Example Use Cases . . . . . . . . . . . . . . . . 45
2.4 Tiny ML . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.1 Characteristics . . . . . . . . . . . . . . . . . . . 46
2.4.2 Benefits . . . . . . . . . . . . . . . . . . . . . . . 48
2.4.3 Challenges . . . . . . . . . . . . . . . . . . . . . 48
TABLE OF CONTENTS iii
3 DL Primer 55
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.1 Definition and Importance . . . . . . . . . . . . 56
3.1.2 Brief History of Deep Learning . . . . . . . . . . 57
3.1.3 Applications of Deep Learning . . . . . . . . . . 59
3.1.4 Relevance to Embedded AI . . . . . . . . . . . . 60
3.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . 60
3.2.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . 60
3.2.2 Multilayer Perceptrons . . . . . . . . . . . . . . 62
3.2.3 Training Process . . . . . . . . . . . . . . . . . . 63
3.2.4 Model Architectures . . . . . . . . . . . . . . . . 65
3.2.5 Traditional ML vs Deep Learning . . . . . . . . 69
3.2.6 Choosing Traditional ML vs. DL . . . . . . . . . 70
3.2.7 Making an Informed Choice . . . . . . . . . . . 72
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 AI Workflow 75
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Traditional vs. Embedded AI . . . . . . . . . . . . . . 77
4.2.1 Resource Optimization . . . . . . . . . . . . . . 78
4.2.2 Real-time Processing . . . . . . . . . . . . . . . 79
4.2.3 Data Management and Privacy . . . . . . . . . . 79
4.2.4 Hardware-Software Integration . . . . . . . . . 79
4.3 Roles & Responsibilities . . . . . . . . . . . . . . . . . 79
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 Resources . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 Data Engineering 83
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Problem Definition . . . . . . . . . . . . . . . . . . . . 86
5.3 Data Sourcing . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.1 Pre-existing datasets . . . . . . . . . . . . . . . . 91
5.3.2 Web Scraping . . . . . . . . . . . . . . . . . . . . 92
5.3.3 Crowdsourcing . . . . . . . . . . . . . . . . . . 95
5.3.4 Synthetic Data . . . . . . . . . . . . . . . . . . . 96
5.4 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5 Data Processing . . . . . . . . . . . . . . . . . . . . . . 102
5.6 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . 105
Table of contents iv
6 AI Frameworks 121
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.2 Framework Evolution . . . . . . . . . . . . . . . . . . . 124
6.3 Deep Dive into TensorFlow . . . . . . . . . . . . . . . 127
6.3.1 TF Ecosystem . . . . . . . . . . . . . . . . . . . 127
6.3.2 Static Computation Graph . . . . . . . . . . . . 130
6.3.3 Usability & Deployment . . . . . . . . . . . . . 131
6.3.4 Architecture Design . . . . . . . . . . . . . . . . 131
6.3.5 Built-in Functionality & Keras . . . . . . . . . . 131
6.3.6 Limitations and Challenges . . . . . . . . . . . . 132
6.3.7 PyTorch vs. TensorFlow . . . . . . . . . . . . . . 133
6.4 Basic Framework Components . . . . . . . . . . . . . . 134
6.4.1 Tensor data structures . . . . . . . . . . . . . . . 135
6.4.2 PyTorch . . . . . . . . . . . . . . . . . . . . . . . 137
6.4.3 TensorFlow . . . . . . . . . . . . . . . . . . . . . 137
6.4.4 Computational graphs . . . . . . . . . . . . . . 138
6.4.5 Data Pipeline Tools . . . . . . . . . . . . . . . . 144
6.4.6 Data Augmentation . . . . . . . . . . . . . . . . 145
6.4.7 Loss Functions and Optimization Algorithms . 146
6.4.8 Model Training Support . . . . . . . . . . . . . 147
6.4.9 Validation and Analysis . . . . . . . . . . . . . . 148
6.4.10 Differentiable programming . . . . . . . . . . . 149
6.4.11 Hardware Acceleration . . . . . . . . . . . . . . 150
6.5 Advanced Features . . . . . . . . . . . . . . . . . . . . 151
6.5.1 Distributed training . . . . . . . . . . . . . . . . 152
6.5.2 Model Conversion . . . . . . . . . . . . . . . . . 152
6.5.3 AutoML, No-Code/Low-Code ML . . . . . . . . 153
6.5.4 Advanced Learning Methods . . . . . . . . . . . 153
6.6 Framework Specialization . . . . . . . . . . . . . . . . 156
6.6.1 Cloud . . . . . . . . . . . . . . . . . . . . . . . . 156
6.6.2 Edge . . . . . . . . . . . . . . . . . . . . . . . . 156
6.6.3 Embedded . . . . . . . . . . . . . . . . . . . . . 157
TABLE OF CONTENTS v
7 AI Training 175
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.2 Mathematics of Neural Networks . . . . . . . . . . . . 178
7.2.1 Neural Network Notation . . . . . . . . . . . . . 179
7.2.2 Loss Function as a Measure of Goodness of Fit
against Training Data . . . . . . . . . . . . . . . 182
7.2.3 Training Neural Networks with Gradient Descent 183
7.2.4 Backpropagation . . . . . . . . . . . . . . . . . . 184
7.3 Differentiable Computation Graphs . . . . . . . . . . . 188
7.4 Training Data . . . . . . . . . . . . . . . . . . . . . . . 188
7.4.1 Dataset Splits . . . . . . . . . . . . . . . . . . . . 190
7.4.2 Common Pitfalls and Mistakes . . . . . . . . . . 190
7.5 Optimization Algorithms . . . . . . . . . . . . . . . . 198
7.5.1 Optimizations . . . . . . . . . . . . . . . . . . . 198
7.5.2 Tradeoffs . . . . . . . . . . . . . . . . . . . . . . 199
7.5.3 Benchmarking Algorithms . . . . . . . . . . . . 200
7.6 Hyperparameter Tuning . . . . . . . . . . . . . . . . . 201
7.6.1 Search Algorithms . . . . . . . . . . . . . . . . . 202
7.6.2 System Implications . . . . . . . . . . . . . . . . 204
7.6.3 Auto Tuners . . . . . . . . . . . . . . . . . . . . 205
7.7 Regularization . . . . . . . . . . . . . . . . . . . . . . . 207
7.7.1 L1 and L2 . . . . . . . . . . . . . . . . . . . . . . 208
7.7.2 Dropout . . . . . . . . . . . . . . . . . . . . . . 210
Table of contents vi
8 EfÏcient AI 233
8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 234
8.2 The Need for EfÏcient AI . . . . . . . . . . . . . . . . . 235
8.3 EfÏcient Model Architectures . . . . . . . . . . . . . . 236
8.4 EfÏcient Model Compression . . . . . . . . . . . . . . 236
8.5 EfÏcient Inference Hardware . . . . . . . . . . . . . . 238
8.6 EfÏcient Numerics . . . . . . . . . . . . . . . . . . . . 240
8.6.1 Numerical Formats . . . . . . . . . . . . . . . . 240
8.6.2 EfÏciency Benefits . . . . . . . . . . . . . . . . . 244
8.7 Evaluating Models . . . . . . . . . . . . . . . . . . . . 244
8.7.1 EfÏciency Metrics . . . . . . . . . . . . . . . . . 245
8.7.2 EfÏciency Comparisons . . . . . . . . . . . . . . 246
8.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 247
8.9 Resources . . . . . . . . . . . . . . . . . . . . . . . . . 248
10 AI Acceleration 321
10.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 322
10.2 Background and Basics . . . . . . . . . . . . . . . . . . 323
10.2.1 Historical Background . . . . . . . . . . . . . . 323
10.2.2 The Need for Acceleration . . . . . . . . . . . . 324
10.2.3 General Principles . . . . . . . . . . . . . . . . . 325
10.3 Accelerator Types . . . . . . . . . . . . . . . . . . . . . 328
10.3.1 Application-Specific Integrated Circuits (ASICs) 329
10.3.2 Field-Programmable Gate Arrays (FPGAs) . . . 333
10.3.3 Digital Signal Processors (DSPs) . . . . . . . . . 336
10.3.4 Graphics Processing Units (GPUs) . . . . . . . . 339
10.3.5 Central Processing Units (CPUs) . . . . . . . . . 342
10.3.6 Comparison . . . . . . . . . . . . . . . . . . . . 345
10.4 Hardware-Software Co-Design . . . . . . . . . . . . . 347
10.4.1 The Need for Co-Design . . . . . . . . . . . . . 347
10.4.2 Principles of Hardware-Software Co-Design . . 349
10.4.3 Challenges . . . . . . . . . . . . . . . . . . . . . 351
Table of contents viii
11 Benchmarking AI 383
11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 384
11.2 Historical Context . . . . . . . . . . . . . . . . . . . . . 386
11.2.1 Performance Benchmarks . . . . . . . . . . . . . 386
11.2.2 Energy Benchmarks . . . . . . . . . . . . . . . . 387
11.2.3 Custom Benchmarks . . . . . . . . . . . . . . . 388
11.2.4 Community Consensus . . . . . . . . . . . . . . 389
11.3 AI Benchmarks: System, Model, and Data . . . . . . . 390
11.3.1 System Benchmarks . . . . . . . . . . . . . . . . 390
11.3.2 Model Benchmarks . . . . . . . . . . . . . . . . 390
11.3.3 Data Benchmarks . . . . . . . . . . . . . . . . . 390
11.4 System Benchmarking . . . . . . . . . . . . . . . . . . 391
11.4.1 Granularity . . . . . . . . . . . . . . . . . . . . . 391
11.4.2 Benchmark Components . . . . . . . . . . . . . 395
11.4.3 Training Benchmarks . . . . . . . . . . . . . . . 397
TABLE OF CONTENTS ix
13 ML Operations 485
13.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 486
13.2 Historical Context . . . . . . . . . . . . . . . . . . . . . 487
13.2.1 DevOps . . . . . . . . . . . . . . . . . . . . . . . 487
13.2.2 MLOps . . . . . . . . . . . . . . . . . . . . . . . 488
13.3 Key Components of MLOps . . . . . . . . . . . . . . . 490
13.3.1 Data Management . . . . . . . . . . . . . . . . . 491
13.3.2 CI/CD Pipelines . . . . . . . . . . . . . . . . . . 493
13.3.3 Model Training . . . . . . . . . . . . . . . . . . 494
13.3.4 Model Evaluation . . . . . . . . . . . . . . . . . 495
13.3.5 Model Deployment . . . . . . . . . . . . . . . . 496
13.3.6 Model Serving . . . . . . . . . . . . . . . . . . . 497
13.3.7 Infrastructure Management . . . . . . . . . . . . 499
13.3.8 Monitoring . . . . . . . . . . . . . . . . . . . . . 499
13.3.9 Governance . . . . . . . . . . . . . . . . . . . . 500
13.3.10 Communication & Collaboration . . . . . . . . . 501
13.4 Hidden Technical Debt in ML Systems . . . . . . . . . 502
13.4.1 Model Boundary Erosion . . . . . . . . . . . . . 502
13.4.2 Entanglement . . . . . . . . . . . . . . . . . . . 503
13.4.3 Correction Cascades . . . . . . . . . . . . . . . . 503
13.4.4 Undeclared Consumers . . . . . . . . . . . . . . 504
13.4.5 Data Dependency Debt . . . . . . . . . . . . . . 504
13.4.6 Analysis Debt from Feedback Loops . . . . . . . 505
13.4.7 Pipeline Jungles . . . . . . . . . . . . . . . . . . 505
13.4.8 Configuration Debt . . . . . . . . . . . . . . . . 506
TABLE OF CONTENTS xi
15 Responsible AI 615
15.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 616
15.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 617
15.3 Principles and Concepts . . . . . . . . . . . . . . . . . 618
15.3.1 Transparency and Explainability . . . . . . . . . 618
15.3.2 Fairness, Bias, and Discrimination . . . . . . . . 618
15.3.3 Privacy and Data Governance . . . . . . . . . . 618
15.3.4 Safety and Robustness . . . . . . . . . . . . . . . 619
15.3.5 Accountability and Governance . . . . . . . . . 619
15.4 Cloud, Edge & Tiny ML . . . . . . . . . . . . . . . . . 620
15.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . 620
15.4.2 Explainability . . . . . . . . . . . . . . . . . . . 620
15.4.3 Fairness . . . . . . . . . . . . . . . . . . . . . . . 621
15.4.4 Safety . . . . . . . . . . . . . . . . . . . . . . . . 622
15.4.5 Accountability . . . . . . . . . . . . . . . . . . . 622
15.4.6 Governance . . . . . . . . . . . . . . . . . . . . 622
15.4.7 Privacy . . . . . . . . . . . . . . . . . . . . . . . 623
15.5 Technical Aspects . . . . . . . . . . . . . . . . . . . . . 623
15.5.1 Detecting and Mitigating Bias . . . . . . . . . . 623
15.5.2 Preserving Privacy . . . . . . . . . . . . . . . . . 627
15.5.3 Machine Unlearning . . . . . . . . . . . . . . . . 628
TABLE OF CONTENTS xiii
16 Sustainable AI 645
16.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 646
16.2 Social and Ethical Responsibility . . . . . . . . . . . . 647
16.2.1 Ethical Considerations . . . . . . . . . . . . . . 647
16.2.2 Long-term Sustainability . . . . . . . . . . . . . 648
16.2.3 AI for Environmental Good . . . . . . . . . . . . 650
16.2.4 Case Study . . . . . . . . . . . . . . . . . . . . . 651
16.3 Energy Consumption . . . . . . . . . . . . . . . . . . . 651
16.3.1 Understanding Energy Needs . . . . . . . . . . 651
16.3.2 Data Centers and Their Impact . . . . . . . . . . 654
16.3.3 Energy Optimization . . . . . . . . . . . . . . . 657
16.4 Carbon Footprint . . . . . . . . . . . . . . . . . . . . . 657
16.4.1 Definition and Significance . . . . . . . . . . . . 658
16.4.2 The Need for Awareness and Action . . . . . . . 659
16.4.3 Estimating the AI Carbon Footprint . . . . . . . 660
16.5 Beyond Carbon Footprint . . . . . . . . . . . . . . . . 662
16.5.1 Water Usage and Stress . . . . . . . . . . . . . . 663
16.5.2 Hazardous Chemicals Usage . . . . . . . . . . . 664
16.5.3 Resource Depletion . . . . . . . . . . . . . . . . 664
16.5.4 Hazardous Waste Generation . . . . . . . . . . . 665
16.5.5 Biodiversity Impacts . . . . . . . . . . . . . . . . 666
16.6 Life Cycle Analysis . . . . . . . . . . . . . . . . . . . . 667
16.6.1 Stages of an AI System’s Life Cycle . . . . . . . 668
16.6.2 Environmental Impact at Each Stage . . . . . . . 668
16.7 Challenges in LCA . . . . . . . . . . . . . . . . . . . . 669
16.7.1 Lack of Consistency and Standards . . . . . . . 669
16.7.2 Data Gaps . . . . . . . . . . . . . . . . . . . . . 670
16.7.3 Rapid Pace of Evolution . . . . . . . . . . . . . . 671
Table of contents xiv
17 Robust AI 697
17.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 698
17.2 Real-World Examples . . . . . . . . . . . . . . . . . . . 700
17.2.1 Cloud . . . . . . . . . . . . . . . . . . . . . . . . 700
17.2.2 Edge . . . . . . . . . . . . . . . . . . . . . . . . 701
17.2.3 Embedded . . . . . . . . . . . . . . . . . . . . . 703
17.3 Hardware Faults . . . . . . . . . . . . . . . . . . . . . 705
17.3.1 Transient Faults . . . . . . . . . . . . . . . . . . 706
17.3.2 Permanent Faults . . . . . . . . . . . . . . . . . 710
17.3.3 Intermittent Faults . . . . . . . . . . . . . . . . . 714
TABLE OF CONTENTS xv
18 Generative AI 783
20 Conclusion 801
20.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 801
20.2 Knowing the Importance of ML Datasets . . . . . . . . 802
Table of contents xvi
I LABS 815
Overview 819
Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . 819
Target Audience . . . . . . . . . . . . . . . . . . . . . . . . . 820
Supported Devices . . . . . . . . . . . . . . . . . . . . . . . . 820
Lab Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
Recommended Lab Sequence . . . . . . . . . . . . . . . . . . 821
Troubleshooting and Support . . . . . . . . . . . . . . . . . . 822
Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
Setup 831
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831
Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 832
Two Parallel Cores . . . . . . . . . . . . . . . . . . . . . 832
Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 833
TABLE OF CONTENTS xvii
Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . 833
Arduino IDE Installation . . . . . . . . . . . . . . . . . . . . 834
Testing the Microphone . . . . . . . . . . . . . . . . . . 835
Testing the IMU . . . . . . . . . . . . . . . . . . . . . . 835
Testing the ToF (Time of Flight) Sensor . . . . . . . . . 836
Testing the Camera . . . . . . . . . . . . . . . . . . . . . 838
Installing the OpenMV IDE . . . . . . . . . . . . . . . . . . . 838
Connecting the Nicla Vision to Edge Impulse Studio . . . . . 846
Expanding the Nicla Vision Board (optional) . . . . . . . . . 849
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854
Post-processing . . . . . . . . . . . . . . . . . . . . . . . 960
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
Case Applications . . . . . . . . . . . . . . . . . . . . . 960
Nicla 3D case . . . . . . . . . . . . . . . . . . . . . . . . 962
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963
Setup 969
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969
Installing the XIAO ESP32S3 Sense on Arduino IDE . . . . . 971
Testing the board with BLINK . . . . . . . . . . . . . . . . . 973
Connecting Sense module (Expansion Board) . . . . . . . . . 974
Microphone Test . . . . . . . . . . . . . . . . . . . . . . . . . 975
Testing the Camera . . . . . . . . . . . . . . . . . . . . . . . . 978
Testing WiFi . . . . . . . . . . . . . . . . . . . . . . . . . . . 979
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
IV Raspberry Pi 1109
Pre-requisites . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112
Setup 1113
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114
Key Features . . . . . . . . . . . . . . . . . . . . . . . . 1114
Raspberry Pi Models (covered in this book) . . . . . . . 1114
Engineering Applications . . . . . . . . . . . . . . . . . 1115
Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . 1116
Raspberry Pi Zero 2W . . . . . . . . . . . . . . . . . . . 1116
Raspberry Pi 5 . . . . . . . . . . . . . . . . . . . . . . . 1116
Installing the Operating System . . . . . . . . . . . . . . . . 1117
The Operating System (OS) . . . . . . . . . . . . . . . . 1117
Installation . . . . . . . . . . . . . . . . . . . . . . . . . 1118
Initial Configuration . . . . . . . . . . . . . . . . . . . . 1121
Remote Access . . . . . . . . . . . . . . . . . . . . . . . . . . 1121
SSH Access . . . . . . . . . . . . . . . . . . . . . . . . . 1121
To shut down the Raspi via terminal: . . . . . . . . . . 1122
Transfer Files between the Raspi and a computer . . . . 1122
Increasing SWAP Memory . . . . . . . . . . . . . . . . . . . 1126
Installing a Camera . . . . . . . . . . . . . . . . . . . . . . . 1127
Installing a USB WebCam . . . . . . . . . . . . . . . . . 1128
Installing a Camera Module on the CSI port . . . . . . . 1132
Running the Raspi Desktop remotely . . . . . . . . . . . . . 1135
Updating and Installing Software . . . . . . . . . . . . . . . 1139
Model-Specific Considerations . . . . . . . . . . . . . . . . . 1139
Raspberry Pi Zero (Raspi-Zero) . . . . . . . . . . . . . . 1139
Raspberry Pi 4 or 5 (Raspi-4 or Raspi-5) . . . . . . . . . 1140
VI REFERENCES 1353
References 1355
i
Preface
Get in Touch
Do you have questions or feedback? Feel free to e-mail Prof. Vijay
Janapa Reddi directly, or you are welcome to start a discussion thread
on GitHub.
Contributors
A big thanks to everyone who’s helped make this book what it is!
You can see the full list of individual contributors here and additional
GitHub style details here. Join us as a contributor!
PREFACE iii
Copyright
This book is open-source and developed collaboratively through
GitHub. Unless otherwise stated, this work is licensed under the
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Inter-
national (CC BY-NC-SA 4.0 CC BY-SA 4.0). You can find the full text
of the license here.
Contributors to this project have dedicated their contributions to
the public domain or under the same open license as the original
project. While the contributions are collaborative, each contributor
retains copyright in their respective contributions.
For details on authorship, contributions, and how to contribute,
please see the project repository on GitHub.
All trademarks and registered trademarks mentioned in this book
are the property of their respective owners.
The information provided in this book is believed to be accurate and
reliable. However, the authors, editors, and publishers cannot be held
liable for any damages caused or alleged to be caused either directly or
indirectly by the information contained in this book.
v
Acknowledgements
This book, inspired by the TinyML edX course and CS294r at Harvard
University, is the result of years of hard work and collaboration with
many students, researchers and practioners. We are deeply indebted
to the folks whose groundbreaking work laid its foundation.
As our understanding of machine learning systems deepened, we re-
alized that fundamental principles apply across scales, from tiny em-
bedded systems to large-scale deployments. This realization shaped
the book’s expansion into an exploration of machine learning systems
with the aim of providing a foundation applicable across the spectrum
of implementations.
Contributors
We express our sincere gratitude to the open-source community of
learners, educators, and contributors. Each contribution, whether a
chapter section or a single-word correction, has significantly enhanced
the quality of this resource. We also acknowledge those who have
shared insights, identified issues, and provided valuable feedback be-
hind the scenes.
A comprehensive list of all GitHub contributors, automatically up-
dated with each new contribution, is available below. For those inter-
ested in contributing further, please consult our GitHub page for more
information.
Vijay Janapa Reddi
jasonjabbour
Ikechukwu Uchendu
Naeem Khoshnevis
Marcelo Rovai
Sara Khosravi
Douwe den Blanken
shanzehbatool
Kai Kleinbard
Elias Nuwara
Matthew Stewart
Jared Ping
Itai Shapira
Maximilian Lam
Jayson Lin
Sophia Cho
Andrea
Jeffrey Ma
Alex Rodriguez
Korneel Van den Berghe
Colby Banbury
Zishen Wan
Abdulrahman Mahmoud
Srivatsan Krishnan
Divya Amirtharaj
Emeka Ezike
Aghyad Deeb
Haoran Qiu
marin-llobet
Emil Njor
Aditi Raju
ACKNOWLEDGEMENTS vii
Jared Ni
Michael Schnebly
oishib
ELSuitorHarvard
Henry Bae
Jae-Won Chung
Yu-Shun Hsiao
Mark Mazumder
Marco Zennaro
Eura Nofshin
Andrew Bass
Pong Trairatvorakul
Jennifer Zhou
Shvetank Prakash
Alex Oesterling
Arya Tschand
Bruno Scaglione
Gauri Jain
Allen-Kuang
Fin Amin
Fatima Shah
The Random DIY
gnodipac886
Sercan Aygün
Baldassarre Cesarano
Abenezer
Bilge Acun
yanjingl
Yang Zhou
abigailswallow
Jason Yik
happyappledog
Curren Iyer
Emmanuel Rassou
Sonia Murthy
Shreya Johri
Jessica Quaye
Vijay Edupuganti
Costin-Andrei Oncescu
Annie Laurie Cook
Jothi Ramaswamy
Batur Arslan
Fatima Shah
Contributors viii
a-saraf
songhan
Zishen
ix
Overview
Welcome to this collaborative textbook, developed as part of the
CS249r Machine Learning Systems class at Harvard University. Our
goal is to provide a comprehensive resource for educators and stu-
dents seeking to understand machine learning systems. This book is
continually updated to incorporate the latest insights and effective
teaching strategies.
Topics Explored
This textbook offers a comprehensive exploration of various aspects
of machine learning systems, covering the entire end-to-end workflow.
Starting with foundational concepts, it progresses through essential
areas such as data engineering, AI frameworks, and model training.
To enhance the learning experience, we included a diverse array
of supplementary materials. These resources consist of slides that
summarize key concepts, videos providing detailed explanations
and demonstrations, exercises designed to reinforce understanding,
and labs that offer hands-on experience with the discussed tools and
techniques.
Readers will gain insights into optimizing models for efÏciency, de-
ploying AI across different hardware platforms, and benchmarking
performance. The book also delves into advanced topics, including
security, privacy, responsible and sustainable AI, robust AI, and gener-
ative AI. Additionally, it examines the social impact of AI, concluding
with an emphasis on the positive contributions AI can make to society.
Tip
Chapter-by-Chapter Insights
Here’s a closer look at what each chapter covers. We have structured
the book into six main sections: Fundamentals, Workflow, Training,
Deployment, Advanced Topics, and Impact. These sections closely
reflect the major components of a typical machine learning pipeline,
from understanding the basic concepts to deploying and maintaining
AI systems in real-world applications. By organizing the content in
this manner, we aim to provide a logical progression that mirrors the
actual process of developing and implementing AI systems.
Fundamentals
In the Fundamentals section, we lay the groundwork for understand-
ing AI. This is far from being a thorough deep dive into the algorithms,
but we aim to introduce key concepts, provide an overview of machine
learning systems, and dive into the principles and algorithms of deep
learning that power AI applications in their associated systems. This
section equips you with the essential knowledge needed to grasp the
subsequent chapters.
Workflow
The Workflow section guides you through the practical aspects of
building AI models. We break down the AI workflow, discuss data
engineering best practices, and review popular AI frameworks. By
the end of this section, you’ll have a clear understanding of the
steps involved in developing proficient AI applications and the tools
available to streamline the process.
Training
In the Training section, we explore techniques for training efÏcient
and reliable AI models. We cover strategies for achieving efÏciency,
model optimizations, and the role of specialized hardware in AI accel-
eration. This section empowers you with the knowledge to develop
high-performing models that can be seamlessly integrated into AI sys-
tems.
7. AI Training: This chapter explores model training, exploring
techniques for developing efÏcient and reliable models.
8. EfÏcient AI: Here, we discuss strategies for achieving efÏciency
in AI applications, from computational resource optimization to
performance enhancement.
Deployment
The Deployment section focuses on the challenges and solutions for de-
ploying AI models. We discuss benchmarking methods to evaluate AI
system performance, techniques for on-device learning to improve efÏ-
ciency and privacy, and the processes involved in ML operations. This
section equips you with the skills to effectively deploy and maintain
AI functionalities in AI systems.
11. Benchmarking AI: This chapter focuses on how to evaluate AI
systems through systematic benchmarking methods.
12. On-Device Learning: We explore techniques for localized learn-
ing, which enhances both efÏciency and privacy.
13. ML Operations: This chapter looks at the processes involved
in the seamless integration, monitoring, and maintenance of AI
functionalities.
Advanced Topics
In the Advanced Topics section, We will study the critical issues sur-
rounding AI. We address privacy and security concerns, explore the
Chapter-by-Chapter Insights xvi
Social Impact
The Impact section highlights the transformative potential of AI in
various domains. We showcase real-world applications of TinyML
in healthcare, agriculture, conservation, and other areas where AI is
making a positive difference. This section inspires you to leverage the
power of AI for societal good and to contribute to the development of
impactful solutions.
19. AI for Good: We highlight positive applications of TinyML in
areas like healthcare, agriculture, and conservation.
Closing
In the Closing section, we reflect on the key learnings from the book
and look ahead to the future of AI. We synthesize the concepts covered,
discuss emerging trends, and provide guidance on continuing your
learning journey in this rapidly evolving field. This section leaves you
with a comprehensive understanding of AI and the excitement to apply
your knowledge in innovative ways.
20. Conclusion: The book concludes with a reflection on the key
learnings and future directions in the field of AI.
ABOUT THE BOOK xvii
Tailored Learning
We understand that readers have diverse interests; some may wish to
grasp the fundamentals, while others are eager to delve into advanced
topics like hardware acceleration or AI ethics. To help you navigate the
book more effectively, we’ve created a persona-based reading guide
tailored to your specific interests and goals. This guide assists you in
identifying the reader persona that best matches your interests. Each
persona represents a distinct reader profile with specific objectives. By
selecting the persona that resonates with you, you can focus on the
chapters and sections most relevant to your needs.
Chapter 1
Introduction
STUDENT would:
Rule-based (1980s):
IF contains("viagra") OR contains("winner") THEN spam
Statistical (1990s):
P(spam|word) = (frequency in spam emails) / (total frequency)
Combined using Naive Bayes:
P(spam|email) � P(spam) × � P(word|spam)
Shallow /
Symbolic Expert Statistical Deep
Aspect AI Systems Learning Learning
Best Use Well- Specific Various Complex,
Case defined, domain structured unstructured
rule-based problems data data
problems problems problems
Data Minimal Domain Moderate Large-scale
Han- data knowledge- data data
dling needed based required processing
AdaptabilityFixed rules Domain- Adaptable Highly
specific to various adaptable to
adaptabil- domains diverse tasks
ity
Problem Simple, Complicated, Complex, Highly
Com- logic-based domain- structured complex,
plexity specific unstructured
The table serves as a bridge between the early approaches we’ve dis-
cussed and the more recent developments in shallow and deep learn-
ing that we’ll explore next. It sets the stage for understanding why
certain approaches gained prominence in different eras and how each
new paradigm built upon and addressed the limitations of its predeces-
sors. Moreover, it illustrates how the strengths of earlier approaches
continue to influence and enhance modern AI techniques, particularly
in the era of foundation models.
What made this era distinct was its hybrid approach: human-
engineered features combined with statistical learning. They had
strong mathematical foundations (researchers could prove why they
worked). They performed well even with limited data. They were
computationally efÏcient. They produced reliable, reproducible
results.
Take the example of face detection, where the Viola-Jones algorithm
(2001) achieved real-time performance using simple rectangular fea-
tures and a cascade of classifiers. This algorithm powered digital cam-
era face detection for nearly a decade.
frontiers and explore the vast unknowns of the universe, their discov-
eries are only possible because of the complex engineering systems
supporting them—the rockets that lift them into space, the life sup-
port systems that keep them alive, and the communication networks
that keep them connected to Earth. Similarly, while AI researchers
push the boundaries of what’s possible with learning algorithms, their
breakthroughs only become practical reality through careful systems
engineering. Modern AI systems need robust infrastructure to collect
and manage data, powerful computing systems to train models, and
reliable deployment platforms to serve millions of users.
This emergence of machine learning systems engineering as a im-
portant discipline reflects a broader reality: turning AI algorithms into
real-world systems requires bridging the gap between theoretical pos-
sibilities and practical implementation. It’s not enough to have a bril-
liant algorithm if you can’t efÏciently collect and process the data it
needs, distribute its computation across hundreds of machines, serve
it reliably to millions of users, or monitor its performance in produc-
tion.
Understanding this interplay between algorithms and engineer-
ing has become fundamental for modern AI practitioners. While
researchers continue to push the boundaries of what’s algorithmically
possible, engineers are tackling the complex challenge of making
these algorithms work reliably and efÏciently in the real world. This
brings us to a fundamental question: what exactly is a machine learn-
ing system, and what makes it different from traditional software
systems?
pose:
from patterns in data. This shift from code to data as the primary
driver of system behavior introduces new complexities.
As illustrated in Figure 1.6, the ML lifecycle consists of intercon-
nected stages from data collection through model monitoring, with
feedback loops for continuous improvement when performance de-
grades or models need enhancement.
Data Aspects
The data ecosystem in FarmBeats is diverse and distributed. Sensors
deployed across fields collect real-time data on soil moisture, temper-
ature, and nutrient levels. Drones equipped with multispectral cam-
eras capture high-resolution imagery of crops, providing insights into
plant health and growth patterns. Weather stations contribute local cli-
mate data, while historical farming records offer context for long-term
trends. The challenge lies not just in collecting this heterogeneous data,
CHAPTER 1. INTRODUCTION 21
but in managing its flow from dispersed, often remote locations with
limited connectivity. FarmBeats employs innovative data transmission
techniques, such as using TV white spaces (unused broadcasting fre-
quencies) to extend internet connectivity to far-flung sensors. This ap-
proach to data collection and transmission embodies the principles of
edge computing we discussed earlier, where data processing begins
at the source to reduce bandwidth requirements and enable real-time
decision making.
Algorithm/Model Aspects
FarmBeats uses a variety of ML algorithms tailored to agricultural
applications. For soil moisture prediction, it uses temporal neural net-
works that can capture the complex dynamics of water movement in
soil. Computer vision algorithms process drone imagery to detect crop
stress, pest infestations, and yield estimates. These models must be
robust to noisy data and capable of operating with limited computa-
tional resources. Machine learning methods such as “transfer learn-
ing” allow models to learn on data-rich farms to be adapted for use
in areas with limited historical data. The system also incorporates a
mixture of methods that combine outputs from multiple algorithms
to improve prediction accuracy and reliability. A key challenge Farm-
Beats addresses is model personalization—adapting general models
to the specific conditions of individual farms, which may have unique
soil compositions, microclimates, and farming practices.
Computing Infrastructure Aspects
FarmBeats exemplifies the edge computing paradigm we explored
in our discussion of the ML system spectrum. At the lowest level, em-
bedded ML models run directly on IoT devices and sensors, perform-
ing basic data filtering and anomaly detection. Edge devices, such as
ruggedized field gateways, aggregate data from multiple sensors and
run more complex models for local decision-making. These edge de-
vices operate in challenging conditions, requiring robust hardware de-
signs and efÏcient power management to function reliably in remote
agricultural settings. The system employs a hierarchical architecture,
with more computationally intensive tasks ofÒoaded to on-premises
servers or the cloud. This tiered approach allows FarmBeats to balance
the need for real-time processing with the benefits of centralized data
analysis and model training. The infrastructure also includes mecha-
nisms for over-the-air model updates, ensuring that edge devices can
receive improved models as more data becomes available and algo-
rithms are refined.
Impact and Future Implications
FarmBeats shows how ML systems can be deployed in resource-
constrained, real-world environments to drive significant improve-
1.8. Real-world Applications and Impact 22
Data Aspects
The data underpinning AlphaFold’s success is vast and multifaceted.
The primary dataset is the Protein Data Bank (PDB), which contains the
CHAPTER 1. INTRODUCTION 23
dards for what data to collect. Some records might have missing infor-
mation, while others might contain errors or inconsistencies that need
to be cleaned up before the data can be useful.
As ML systems grow, they often need to handle increasingly large
amounts of data. A video streaming service like Netflix, for example,
needs to process billions of viewer interactions to power its recommen-
dation system. This scale introduces new challenges in how to store,
process, and manage such large datasets efÏciently.
Another critical challenge is how data changes over time. This phe-
nomenon, known as “data drift,” occurs when the patterns in new data
begin to differ from the patterns the system originally learned from.
For example, many predictive models struggled during the COVID-19
pandemic because consumer behavior changed so dramatically that
historical patterns became less relevant. ML systems need ways to de-
tect when this happens and adapt accordingly.
sign and deployment. Throughout this book, we’ll explore these chal-
lenges in detail and examine strategies for addressing them effectively.
As illustrated in Figure Figure 1.9, the five pillars central to the frame-
work are:
Chapter 2
ML Systems
Learning Objectives
2.1 Overview
ML is rapidly evolving, with new paradigms reshaping how models
are developed, trained, and deployed. The field is experiencing signif-
icant innovation driven by advancements in hardware, software, and
algorithmic techniques. These developments are enabling machine
learning to be applied in diverse settings, from large-scale cloud in-
frastructures to edge devices and even tiny, resource-constrained envi-
ronments.
Modern machine learning systems span a spectrum of deployment
options, each with its own set of characteristics and use cases. At one
end, we have cloud-based ML, which leverages powerful centralized
computing resources for complex, data-intensive tasks. Moving along
the spectrum, we encounter edge ML, which brings computation closer
to the data source for reduced latency and improved privacy. At the
far end, we find TinyML, which enables machine learning on extremely
low-power devices with severe memory and processing constraints.
This chapter explores the landscape of contemporary machine learn-
ing systems, covering three key approaches: Cloud ML, Edge ML, and
TinyML. Figure 2.2 illustrates the spectrum of distributed intelligence
across these approaches, providing a visual comparison of their charac-
teristics. We will examine the unique characteristics, advantages, and
challenges of each approach, as depicted in the figure. Additionally,
CHAPTER 2. ML SYSTEMS 35
we will discuss the emerging trends and technologies that are shap-
ing the future of machine learning deployment, considering how they
might influence the balance between these three paradigms.
Each of these paradigms has its own strengths and is suited to dif-
ferent use cases:
2.2 Cloud ML
Cloud ML leverages powerful servers in the cloud for training and run-
ning large, complex ML models and relies on internet connectivity. Fig-
ure 2.4 provides an overview of Cloud ML’s capabilities which we will
discuss in greater detail throughout this section.
CHAPTER 2. ML SYSTEMS 37
2.2.1 Characteristics
Definition of Cloud ML
Cloud Machine Learning (Cloud ML) is a subfield of machine learn-
ing that leverages the power and scalability of cloud computing infras-
tructure to develop, train, and deploy machine learning models. By uti-
lizing the vast computational resources available in the cloud, Cloud
ML enables the efÏcient handling of large-scale datasets and complex
machine learning algorithms.
Centralized Infrastructure
One of the key characteristics of Cloud ML is its centralized infras-
tructure. Figure 2.5 illustrates this concept with an example from
Google’s Cloud TPU data center. Cloud service providers offer a
virtual platform that consists of high-capacity servers, expansive
storage solutions, and robust networking architectures, all housed
in data centers distributed across the globe. As shown in the figure,
these centralized facilities can be massive in scale, housing rows
upon rows of specialized hardware. This centralized setup allows for
the pooling and efÏcient management of computational resources,
making it easier to scale machine learning projects as needed.
Scalable Data Processing and Model Training
Cloud ML excels in its ability to process and analyze massive vol-
2.2. Cloud ML 38
2.2.2 Benefits
Cloud ML offers several significant benefits that make it a powerful
choice for machine learning projects:
Immense Computational Power
One of the key advantages of Cloud ML is its ability to provide vast
computational resources. The cloud infrastructure is designed to han-
dle complex algorithms and process large datasets efÏciently. This is
particularly beneficial for machine learning models that require signif-
icant computational power, such as deep learning networks or models
trained on massive datasets. By leveraging the cloud’s computational
capabilities, organizations can overcome the limitations of local hard-
ware setups and scale their machine learning projects to meet demand-
ing requirements.
Dynamic Scalability
Cloud ML offers dynamic scalability, allowing organizations to eas-
ily adapt to changing computational needs. As the volume of data
grows or the complexity of machine learning models increases, the
cloud infrastructure can seamlessly scale up or down to accommodate
these changes. This flexibility ensures consistent performance and en-
ables organizations to handle varying workloads without the need for
extensive hardware investments. With Cloud ML, resources can be al-
located on-demand, providing a cost-effective and efÏcient solution for
managing machine learning projects.
Access to Advanced Tools and Algorithms
Cloud ML platforms provide access to a wide range of advanced
tools and algorithms specifically designed for machine learning.
These tools often include pre-built libraries, frameworks, and APIs
that simplify the development and deployment of machine learning
models. Developers can leverage these resources to accelerate the
building, training, and optimization of sophisticated models. By
utilizing the latest advancements in machine learning algorithms and
techniques, organizations can stay at the forefront of innovation and
achieve better results in their machine learning projects.
Collaborative Environment
Cloud ML fosters a collaborative environment that enables teams
to work together seamlessly. The centralized nature of the cloud
infrastructure allows multiple users to access and contribute to the
same machine learning projects simultaneously. This collaborative
approach facilitates knowledge sharing, promotes cross-functional
2.2. Cloud ML 40
2.2.3 Challenges
While Cloud ML offers numerous benefits, it also comes with certain
challenges that organizations need to consider:
Latency Issues
One of the main challenges of Cloud ML is the potential for latency
issues, especially in applications that require real-time responses.
Since data needs to be sent from the data source to centralized cloud
servers for processing and then back to the application, there can be
delays introduced by network transmission. This latency can be a
significant drawback in time-sensitive scenarios, such as autonomous
vehicles, real-time fraud detection, or industrial control systems,
where immediate decision-making is critical. Developers need to care-
fully design their systems to minimize latency and ensure acceptable
response times.
Data Privacy and Security Concerns
Centralizing data processing and storage in the cloud can raise con-
cerns about data privacy and security. When sensitive data is trans-
mitted and stored in remote data centers, it becomes vulnerable to po-
CHAPTER 2. ML SYSTEMS 41
2.3 Edge ML
2.3.1 Characteristics
Definition of Edge ML
Edge Machine Learning (Edge ML) runs machine learning algo-
rithms directly on endpoint devices or closer to where the data is
generated rather than relying on centralized cloud servers. This
approach brings computation closer to the data source, reducing the
need to send large volumes of data over networks, often resulting
in lower latency and improved data privacy. Figure 2.6 provides an
overview of this section.
Decentralized Data Processing
In Edge ML, data processing happens in a decentralized fashion, as
illustrated in Figure 2.7. Instead of sending data to remote servers, the
data is processed locally on devices like smartphones, tablets, or Inter-
net of Things (IoT) devices. The figure showcases various examples of
these edge devices, including wearables, industrial sensors, and smart
home appliances. This local processing allows devices to make quick
decisions based on the data they collect without relying heavily on a
2.3. Edge ML 44
2.3.2 Benefits
Reduced Latency
One of Edge ML’s main advantages is the significant latency reduc-
tion compared to Cloud ML. This reduced latency can be a critical ben-
efit in situations where milliseconds count, such as in autonomous ve-
hicles, where quick decision-making can mean the difference between
safety and an accident.
Enhanced Data Privacy
Edge ML also offers improved data privacy, as data is primarily
stored and processed locally. This minimizes the risk of data breaches
that are more common in centralized data storage solutions. Sensitive
information can be kept more secure, as it’s not sent over networks
that could be intercepted.
Lower Bandwidth Usage
Operating closer to the data source means less data must be sent over
networks, reducing bandwidth usage. This can result in cost savings
and efÏciency gains, especially in environments where bandwidth is
limited or costly.
2.3.3 Challenges
Limited Computational Resources Compared to Cloud ML
However, Edge ML has its challenges. One of the main concerns
is the limited computational resources compared to cloud-based so-
lutions. Endpoint devices may have a different processing power or
storage capacity than cloud servers, limiting the complexity of the ma-
chine learning models that can be deployed.
Complexity in Managing Edge Nodes
Managing a network of edge nodes can introduce complexity, espe-
cially regarding coordination, updates, and maintenance. Ensuring all
nodes operate seamlessly and are up-to-date with the latest algorithms
and security protocols can be a logistical challenge.
Security Concerns at the Edge Nodes
While Edge ML offers enhanced data privacy, edge nodes can some-
times be more vulnerable to physical and cyber-attacks. Developing
robust security protocols that protect data at each node without com-
promising the system’s efÏciency remains a significant challenge in de-
ploying Edge ML solutions.
2.4 Tiny ML
2.4.1 Characteristics
Definition of TinyML
TinyML sits at the crossroads of embedded systems and machine
learning, representing a burgeoning field that brings smart algorithms
directly to tiny microcontrollers and sensors. These microcontrollers
operate under severe resource constraints, particularly regarding mem-
ory, storage, and computational power. Figure 2.8 encapsulates the
key aspects of TinyML discussed in this section.
On-Device Machine Learning
CHAPTER 2. ML SYSTEMS 47
2.4.2 Benefits
Extremely Low Latency
One of the standout benefits of TinyML is its ability to offer ultra-low
latency. Since computation occurs directly on the device, the time re-
quired to send data to external servers and receive a response is elim-
inated. This is crucial in applications requiring immediate decision-
making, enabling quick responses to changing conditions.
High Data Security
TinyML inherently enhances data security. Because data processing
and analysis happen on the device, the risk of data interception dur-
ing transmission is virtually eliminated. This localized approach to
data management ensures that sensitive information stays on the de-
vice, strengthening user data security.
Energy EfÏciency
TinyML operates within an energy-efÏcient framework, a necessity
given its resource-constrained environments. By employing lean algo-
rithms and optimized computational methods, TinyML ensures that
devices can execute complex tasks without rapidly depleting battery
life, making it a sustainable option for long-term deployments.
2.4.3 Challenges
Limited Computational Capabilities
However, the shift to TinyML comes with its set of hurdles. The pri-
mary limitation is the devices’ constrained computational capabilities.
CHAPTER 2. ML SYSTEMS 49
The need to operate within such limits means that deployed models
must be simplified, which could affect the accuracy and sophistication
of the solutions.
Complex Development Cycle
TinyML also introduces a complicated development cycle. Craft-
ing lightweight and effective models demands a deep understanding
of machine learning principles and expertise in embedded systems.
This complexity calls for a collaborative development approach, where
multi-domain expertise is essential for success.
Model Optimization and Compression
A central challenge in TinyML is model optimization and compres-
sion. Creating machine learning models that can operate effectively
within the limited memory and computational power of microcon-
trollers requires innovative approaches to model design. Developers
often face the challenge of striking a delicate balance and optimiz-
ing models to maintain effectiveness while fitting within stringent
resource constraints.
2.5 Comparison
Let’s bring together the different ML variants we’ve explored individu-
ally for a comprehensive view. Figure 2.10 illustrates the relationships
and overlaps between Cloud ML, Edge ML, and TinyML using a Venn
diagram. This visual representation effectively highlights the unique
characteristics of each approach while also showing areas of common-
ality. Each ML paradigm has its own distinct features, but there are
also intersections where these approaches share certain attributes or
capabilities. This diagram helps us understand how these variants re-
late to each other in the broader landscape of machine learning imple-
mentations.
Table 2.1: Comparison of feature aspects across Cloud ML, Edge ML,
and TinyML.
Aspect Cloud ML Edge ML TinyML
Processing Centralized Local devices On-device
Loca- servers (Data (closer to data (microcontrollers,
tion Centers) sources) embedded
systems)
Latency High (Depends Moderate Low (Immediate
on internet (Reduced latency processing
connectivity) compared to without network
Cloud ML) delay)
Data Moderate (Data High (Data Very High (Data
Privacy transmitted over remains on local processed
networks) networks) on-device, not
transmitted)
Computational
High (Utilizes Moderate Low (Limited to
Power powerful data (Utilizes local the power of the
center device embedded
infrastructure) capabilities) system)
Energy High (Data Moderate (Less Low (Highly
Con- centers consume than data centers, energy-efÏcient,
sump- significant more than designed for low
tion energy) TinyML) power)
Scalability High (Easy to Moderate Low (Limited by
scale with (Depends on the hardware
additional server local device resources of the
resources) capabilities) device)
Cost High (Recurring Variable Low (Primarily
costs for server (Depends on the upfront costs for
usage, complexity of hardware
maintenance) local setup) components)
ConnectivityHigh (Requires Low (Can Very Low (Can
stable internet operate with operate without
connectivity) intermittent any network
connectivity) connectivity)
Real- Moderate (Can High (Capable of Very High
time be affected by real-time (Immediate
Process- network latency) processing processing with
ing locally) minimal latency)
ApplicationBig Data Autonomous Wearables,
Exam- Analysis, Virtual Vehicles, Smart Sensor Networks
ples Assistants Homes
2.6. Conclusion 52
2.6 Conclusion
In this chapter, we’ve offered a panoramic view of the evolving
landscape of machine learning, covering cloud, edge, and tiny ML
paradigms. Cloud-based machine learning leverages the immense
computational resources of cloud platforms to enable powerful and
accurate models but comes with limitations, including latency and
privacy concerns. Edge ML mitigates these limitations by bringing
inference directly to edge devices, offering lower latency and reduced
connectivity needs. TinyML takes this further by miniaturizing
ML models to run directly on highly resource-constrained devices,
opening up a new category of intelligent applications.
Each approach has its tradeoffs, including model complexity,
latency, privacy, and hardware costs. Over time, we anticipate
converging these embedded ML approaches, with cloud pre-training
facilitating more sophisticated edge and tiny ML implementations.
Advances like federated learning and on-device learning will enable
embedded devices to refine their models by learning from real-world
data.
The embedded ML landscape is rapidly evolving and poised to en-
able intelligent applications across a broad spectrum of devices and
use cases. This chapter serves as a snapshot of the current state of em-
bedded ML. As algorithms, hardware, and connectivity continue to
improve, we can expect embedded devices of all sizes to become in-
creasingly capable, unlocking transformative new applications for ar-
tificial intelligence.
2.7 Resources
Here is a curated list of resources to support students and instructors
in their learning and teaching journeys. We are continuously working
on expanding this collection and will be adding new exercises soon.
CHAPTER 2. ML SYSTEMS 53
Slides
• Embedded ML software.
• Embedded Inference.
• TinyML on Microcontrollers.
Videos
• Coming soon.
Exercises
• Coming soon.
55
Chapter 3
DL Primer
Learning Objectives
3.1 Overview
3.1.1 Definition and Importance
Deep learning, a specialized area within machine learning and artifi-
cial intelligence (AI), utilizes algorithms modeled after the structure
and function of the human brain, known as artificial neural networks.
This field is a foundational element in AI, driving progress in diverse
sectors such as computer vision, natural language processing, and
self-driving vehicles. Its significance in embedded AI systems is
highlighted by its capability to handle intricate calculations and
predictions, optimizing the limited resources in embedded settings.
Figure 3.2 provides a visual representation of how deep learning fits
within the broader context of AI and machine learning. The diagram
illustrates the chronological development and relative segmentation of
these three interconnected fields, showcasing deep learning as a spe-
cialized subset of machine learning, which in turn is a subset of AI.
As shown in the figure, AI represents the overarching field, encom-
passing all computational methods that mimic human cognitive func-
tions. Machine learning, shown as a subset of AI, includes algorithms
capable of learning from data. Deep learning, the smallest subset in
the diagram, specifically involves neural networks that are able to learn
more complex patterns from large volumes of data.
CHAPTER 3. DL PRIMER 57
lives.
3.2.1 Perceptrons
The Perceptron is the basic unit or node that forms the foundation for
more complex structures. It functions by taking multiple inputs, each
representing a feature of the object under analysis, such as the charac-
teristics of a home for predicting its price or the attributes of a song to
forecast its popularity in music streaming services. These inputs are
denoted as 𝑥1 , 𝑥2 , ..., 𝑥𝑛 . A perceptron can be configured to perform
either regression or classification tasks. For regression, the actual nu-
merical output 𝑦 ̂ is used. For classification, the output depends on
CHAPTER 3. DL PRIMER 61
𝑧 = ∑(𝑥𝑖 ⋅ 𝑤𝑖𝑗 )
To this intermediate calculation, a bias term 𝑏 is added, allowing the
model to better fit the data by shifting the linear output function up
or down. Thus, the intermediate linear combination computed by the
perceptron including the bias becomes:
𝑧 = ∑(𝑥𝑖 ⋅ 𝑤𝑖𝑗 ) + 𝑏
3.2. Neural Networks 62
𝑦 ̂ = 𝜎(𝑧)
Figure 3.6 illustrates an example where data exhibit a nonlinear pat-
tern that could not be adequately modeled with a linear approach. The
activation function, such as sigmoid, tanh, or ReLU, transforms the
linear input sum into a non-linear output. The primary objective of
this function is to introduce non-linearity into the model, enabling it
to learn and perform more sophisticated tasks. Thus, the final output
of the perceptron, including the activation function, can be expressed
as:
The forward pass is the initial phase where data moves through the
network from the input to the output layer, as illustrated in Figure 3.8.
At the start of training, the network’s weights are randomly initial-
ized, setting the initial conditions for learning. During the forward
pass, each layer performs specific computations on the input data us-
ing these weights and biases, and the results are then passed to the
subsequent layer. The final output of this phase is the network’s pre-
diction. This prediction is compared to the actual target values present
in the dataset to calculate the loss, which can be thought of as the dif-
ference between the predicted outputs and the target values. The loss
quantifies the network’s performance at this stage, providing a crucial
metric for the subsequent adjustment of weights during the backward
pass.
After completing the forward pass and computing the loss, which mea-
sures how far the model’s predictions deviate from the actual target
values, the next step is to improve the model’s performance by adjust-
ing the network’s weights. Since we cannot control the inputs to the
model, adjusting the weights becomes our primary method for refin-
ing the model.
We determine how to adjust the weights of our model through a
key algorithm called backpropagation. Backpropagation uses the cal-
culated loss to determine the gradient of each weight. These gradients
describe the direction and magnitude in which the weights should be
adjusted. By tuning the weights based on these gradients, the model
CHAPTER 3. DL PRIMER 65
https://www.youtube.com/watch?v=IHZwWFHWa-w&list=
PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=2
Important 3: Backpropagation
https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=
PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=3
ues.
Image Classification: Discover how to build a network
to understand the famous MNIST handwritten digit
dataset.
Real-world medical diagnosis: Use deep learning to
tackle the important task of breast cancer classifica-
tion.
CNNs are mainly used in image and video recognition tasks. This ar-
chitecture consists of two main parts: the convolutional base and the
fully connected layers. In the convolutional base, convolutional layers
filter input data to identify features like edges, corners, and textures.
Following each convolutional layer, a pooling layer can be applied to
reduce the spatial dimensions of the data, thereby decreasing compu-
tational load and concentrating the extracted features. Unlike MLPs,
which treat input features as flat, independent entities, CNNs main-
tain the spatial relationships between pixels, making them particularly
effective for image and video data. The extracted features from the con-
volutional base are then passed into the fully connected layers, similar
to those used in MLPs, which perform classification based on the fea-
tures extracted by the convolution layers. CNNs have proven highly
effective in image recognition, object detection, and other computer vi-
sion applications.
Video 4 explains how neural networks work using handwritten digit
recognition as an example application. It also touches on the math
underlying neural nets.
CHAPTER 3. DL PRIMER 67
https://www.youtube.com/embed/aircAruvnKk?si=
ZRj8jf4yx7ZMe8EK
CNNs are crucial for image and video recognition tasks, where real-
time processing is often needed. They can be optimized for embedded
systems using techniques like quantization and pruning to minimize
memory usage and computational demands, enabling efÏcient object
detection and facial recognition functionalities in devices with limited
computational resources.
RNNs are suitable for sequential data analysis, like time series fore-
casting and natural language processing. In this architecture, connec-
tions between nodes form a directed graph along a temporal sequence,
allowing information to be carried across sequences through hidden
state vectors. Variants of RNNs include Long Short-Term Memory
(LSTM) and Gated Recurrent Units (GRU), designed to capture longer
dependencies in sequence data.
These networks can be used in voice recognition systems, predictive
maintenance, or IoT devices where sequential data patterns are com-
mon. Optimizations specific to embedded platforms can assist in man-
aging their typically high computational and memory requirements.
3.2. Neural Networks 68
3.2.4.5 Autoencoders
Autoencoders are neural networks for data compression and noise re-
duction (Bank, Koenigstein, and Giryes 2023). They are structured to
encode input data into a lower-dimensional representation and then
decode it back to its original form. Variants like Variational Autoen-
coders (VAEs) introduce probabilistic layers that allow for generative
properties, finding applications in image generation and anomaly de-
tection.
Using autoencoders can help in efÏcient data transmission and stor-
age, improving the overall performance of embedded systems with
limited computational and memory resources.
As shown in the figure, deep learning models can process raw data
directly and automatically extract relevant features, while traditional
machine learning often requires manual feature engineering. The fig-
ure also illustrates how deep learning models can handle more com-
plex tasks and larger datasets compared to traditional machine learn-
ing approaches.
To further highlight the differences, Table 3.1 provides a more de-
tailed comparison of the contrasting characteristics between traditional
ML and deep learning. This table complements the visual representa-
tion in Figure 3.9 by offering specific points of comparison across vari-
ous aspects of these two approaches.
3.2. Neural Networks 70
3.2.6.5 Interpretability
3.3 Conclusion
Deep learning has become a potent set of techniques for addressing in-
tricate pattern recognition and prediction challenges. Starting with an
overview, we outlined the fundamental concepts and principles gov-
erning deep learning, laying the groundwork for more advanced stud-
ies.
Central to deep learning, we explored the basic ideas of neural net-
works, powerful computational models inspired by the human brain’s
interconnected neuron structure. This exploration allowed us to ap-
preciate neural networks’ capabilities and potential in creating sophis-
ticated algorithms capable of learning and adapting from data.
Understanding the role of libraries and frameworks was a key part
of our discussion. We offered insights into the tools that can facilitate
developing and deploying deep learning models. These resources ease
the implementation of neural networks and open avenues for innova-
tion and optimization.
Next, we tackled the challenges one might face when embedding
deep learning algorithms within embedded systems, providing a crit-
ical perspective on the complexities and considerations of bringing AI
to edge devices.
Furthermore, we examined deep learning’s limitations. Through
discussions, we unraveled the challenges faced in deep learning ap-
plications and outlined scenarios where traditional machine learning
might outperform deep learning. These sections are crucial for foster-
ing a balanced view of deep learning’s capabilities and limitations.
In this primer, we have equipped you with the knowledge to make
informed choices between deploying traditional machine learning or
deep learning techniques, depending on the unique demands and con-
straints of a specific problem.
As we conclude this chapter, we hope you are now well-equipped
with the basic “language” of deep learning and prepared to go deeper
into the subsequent chapters with a solid understanding and critical
perspective. The journey ahead is filled with exciting opportunities
and challenges in embedding AI within systems.
3.4 Resources
Here is a curated list of resources to support students and instructors
in their learning and teaching journeys. We are continuously working
3.4. Resources 74
Slides
• Intro to Convolutions.
Videos
• Video 4
• Video 2
• Video 3
Exercises
• Exercise 2
• Exercise 3
75
Chapter 4
AI Workflow
Learning Objectives
4.1 Overview
the nature of machine learning models depends on the data they con-
sume, the models are unique and vary with different applications, ne-
cessitating extensive experimentation. Machine learning researchers
and engineers drive this experimental phase through continuous test-
ing, validation, and iteration to achieve optimal performance.
The deployment phase often requires specialized hardware and in-
frastructure, as machine learning models can be resource-intensive,
demanding high computational power and efÏcient resource manage-
ment. This necessitates collaboration with hardware engineers to en-
sure that the infrastructure can support the computational demands
of model training and inference.
As models make decisions that can impact individuals and society,
ethical and legal aspects of machine learning are becoming increas-
ingly important. Ethicists and legal advisors are needed to ensure com-
pliance with ethical standards and legal regulations.
Understanding the various roles involved in an ML project is crucial
for its successful completion. Table 4.1 provides a general overview
of these typical roles, although it’s important to note that the lines be-
tween them can sometimes blur. Let’s examine this breakdown:
Role Responsibilities
Operations and Monitor and maintain the deployed system.
Maintenance
Personnel
Security Specialists Ensure system security.
4.4 Conclusion
This chapter has laid the foundation for understanding the machine
learning workflow, a structured approach crucial for the development,
deployment, and maintenance of ML models. We explored the unique
challenges faced in ML workflows, where resource optimization, real-
time processing, data management, and hardware-software integra-
tion are paramount. These distinctions underscore the importance of
tailoring workflows to meet the specific demands of the application
environment.
Moreover, we emphasized the significance of multidisciplinary col-
laboration in ML projects. By examining the diverse roles involved,
from data scientists to software engineers, we gained an overview of
the teamwork necessary to navigate the experimental and resource-
intensive nature of ML development. This understanding is crucial
for fostering effective communication and collaboration across differ-
ent domains of expertise.
As we move forward to more detailed discussions in subsequent
chapters, this high-level overview equips us with a holistic perspective
on the ML workflow and the various roles involved. This foundation
will prove important as we dive into specific aspects of machine learn-
ing, which will allow us to contextualize advanced concepts within the
broader framework of ML development and deployment.
4.5 Resources
Here is a curated list of resources to support students and instructors
in their learning and teaching journeys. We are continuously working
4.5. Resources 82
Slides
• ML Workflow.
• ML Lifecycle.
Videos
• Coming soon.
Exercises
• Coming soon.
83
Chapter 5
Data Engineering
Data is the lifeblood of AI systems. Without good data, even the most
advanced machine-learning algorithms will not succeed. However,
TinyML models operate on devices with limited processing power and
memory. This section explores the intricacies of building high-quality
datasets to fuel our AI models. Data engineering involves collecting,
storing, processing, and managing data to train machine learning mod-
els.
5.1. Overview 84
Learning Objectives
5.1 Overview
Imagine a world where AI can diagnose diseases with unprecedented
accuracy, but only if the data used to train it is unbiased and reliable.
This is where data engineering comes in. While over 90% of the world’s
data has been created in the past two decades, this vast amount of infor-
mation is only helpful for building effective AI models with proper pro-
cessing and preparation. Data engineering bridges this gap by trans-
forming raw data into a high-quality format that fuels AI innovation.
In today’s data-driven world, protecting user privacy is paramount.
Whether mandated by law or driven by user concerns, anonymization
techniques like differential privacy and aggregation are vital in mitigat-
ing privacy risks. However, careful implementation is crucial to ensure
these methods don’t compromise data utility. Dataset creators face
complex privacy and representation challenges when building high-
quality training data, especially for sensitive domains like healthcare.
CHAPTER 5. DATA ENGINEERING 85
Legally, creators may need to remove direct identifiers like names and
ages. Even without legal obligations, removing such information can
help build user trust. However, excessive anonymization can compro-
mise dataset utility. Techniques like differential privacy1 , aggregation,
and reducing detail provide alternatives to balance privacy and utility
but have downsides. Creators must strike a thoughtful balance based
on the use case.
While privacy is paramount, ensuring fair and robust AI models re-
quires addressing representation gaps in the data. It is crucial yet insuf-
ficient to ensure diversity across individual variables like gender, race,
and accent. These combinations, sometimes called higher-order gaps,
can significantly impact model performance. For example, a medical
dataset could have balanced gender, age, and diagnosis data individu-
ally, but it lacks enough cases to capture older women with a specific
condition. Such higher-order gaps are not immediately obvious but
can critically impact model performance.
Creating useful, ethical training data requires holistic consid-
eration of privacy risks and representation gaps. Elusive perfect
solutions necessitate conscientious data engineering practices like
anonymization, aggregation, under-sampling of overrepresented
groups, and synthesized data generation to balance competing needs.
This facilitates models that are both accurate and socially responsible.
Cross-functional collaboration and external audits can also strengthen
training data. The challenges are multifaceted but surmountable with
thoughtful effort.
We begin by discussing data collection: Where do we source data,
and how do we gather it? Options range from scraping the web, access-
ing APIs, and utilizing sensors and IoT devices to conducting surveys
and gathering user input. These methods reflect real-world practices.
Next, we dive into data labeling, including considerations for human
involvement. We’ll discuss the trade-offs and limitations of human
labeling and explore emerging methods for automated labeling. Fol-
lowing that, we’ll address data cleaning and preprocessing, a crucial
yet frequently undervalued step in preparing raw data for AI model
training. Data augmentation comes next, a strategy for enhancing lim-
ited datasets by generating synthetic samples. This is particularly per-
tinent for embedded systems, as many use cases need extensive data
repositories readily available for curation. Synthetic data generation
emerges as a viable alternative with advantages and disadvantages.
We’ll also touch upon dataset versioning, emphasizing the importance
of tracking data modifications over time. Data is ever-evolving; hence,
it’s imperative to devise strategies for managing and storing expansive
datasets. By the end of this section, you’ll possess a comprehensive
5.2. Problem Definition 86
Web scraping can effectively gather large datasets for training ma-
chine learning models, particularly when human-labeled data is scarce.
For computer vision research, web scraping enables the collection of
massive volumes of images and videos. Researchers have used this
technique to build influential datasets like ImageNet and OpenImages.
For example, one could scrape e-commerce sites to amass product pho-
tos for object recognition or social media platforms to collect user up-
loads for facial analysis. Even before ImageNet, Stanford’s LabelMe
project scraped Flickr for over 63,000 annotated images covering hun-
dreds of object categories.
Beyond computer vision, web scraping supports gathering textual
data for natural language tasks. Researchers can scrape news sites for
sentiment analysis data, forums and review sites for dialogue systems
research, or social media for topic modeling. For example, the training
data for chatbot ChatGPT was obtained by scraping much of the public
Internet. GitHub repositories were scraped to train GitHub’s Copilot
AI coding assistant.
Web scraping can also collect structured data, such as stock prices,
weather data, or product information, for analytical applications. Once
data is scraped, it is essential to store it in a structured manner, often
using databases or data warehouses. Proper data management ensures
the usability of the scraped data for future analysis and applications.
However, while web scraping offers numerous advantages, there are
significant limitations and ethical considerations to bear. Not all web-
sites permit scraping, and violating these restrictions can lead to le-
gal repercussions. Scraping copyrighted material or private communi-
cations is also unethical and potentially illegal. Ethical web scraping
mandates adherence to a website’s ‘robots.txt’ file, which outlines the
sections of the site that can be accessed and scraped by automated bots.
To deter automated scraping, many websites implement rate limits.
If a bot sends too many requests in a short period, it might be tem-
porarily blocked, restricting the speed of data access. Additionally,
the dynamic nature of web content means that data scraped at differ-
ent intervals might need more consistency, posing challenges for lon-
gitudinal studies. However, there are emerging trends like Web Navi-
gation where machine learning algorithms can automatically navigate
the website to access the dynamic content.
The volume of pertinent data available for scraping might be limited
for niche subjects. For example, while scraping for common topics like
images of cats and dogs might yield abundant data, searching for rare
medical conditions might be less fruitful. Moreover, the data obtained
through scraping is often unstructured and noisy, necessitating thor-
ough preprocessing and cleaning. It is crucial to understand that not
5.3. Data Sourcing 94
5.3.3 Crowdsourcing
Crowdsourcing for datasets is the practice of obtaining data using the
services of many people, either from a specific community or the gen-
eral public, typically via the Internet. Instead of relying on a small
team or specific organization to collect or label data, crowdsourcing
leverages the collective effort of a vast, distributed group of partici-
pants. Services like Amazon Mechanical Turk enable the distribution
of annotation tasks to a large, diverse workforce. This facilitates the
collection of labels for complex tasks like sentiment analysis or image
recognition requiring human judgment.
Crowdsourcing has emerged as an effective approach for data col-
lection and problem-solving. One major advantage of crowdsourcing
is scalability—by distributing tasks to a large, global pool of contrib-
utors on digital platforms, projects can process huge volumes of data
quickly. This makes crowdsourcing ideal for large-scale data labeling,
collection, and analysis.
In addition, crowdsourcing taps into a diverse group of participants,
bringing a wide range of perspectives, cultural insights, and language
abilities that can enrich data and enhance creative problem-solving in
ways that a more homogenous group may not. Because crowdsourc-
ing draws from a large audience beyond traditional channels, it is more
cost-effective than conventional methods, especially for simpler micro-
tasks.
Crowdsourcing platforms also allow for great flexibility, as task pa-
rameters can be adjusted in real time based on initial results. This cre-
ates a feedback loop for iterative improvements to the data collection
process. Complex jobs can be broken down into microtasks and dis-
tributed to multiple people, with results cross-validated by assigning
redundant versions of the same task. When thoughtfully managed,
crowdsourcing enables community engagement around a collabora-
tive project, where participants find reward in contributing.
However, while crowdsourcing offers numerous advantages, it’s es-
sential to approach it with a clear strategy. While it provides access to a
diverse set of annotators, it also introduces variability in the quality of
annotations. Additionally, platforms like Mechanical Turk might not
always capture a complete demographic spectrum; often, tech-savvy
individuals are overrepresented, while children and older people may
5.3. Data Sourcing 96
Data
Database Warehouse Data Lake
Scale Small to Large volumes of integrated
large data Large volumes of diverse
volumes of data
data
Examples MySQL Google BigQuery, Amazon
Redshift, Microsoft Azure
Synapse, Google Cloud
Storage, AWS S3, Azure Data
Lake Storage
to safeguard data quality and monitor its utilization and related risks.
dataset is intended for academic study and business uses in areas like
keyword identification and speech-based search. It is openly licensed
under Creative Commons Attribution 4.0 for broad usage.
Here are some examples of how AI-assisted annotation has been pro-
posed to be useful:
With data version control in place, we can track the changes shown
in Figure 5.14, reproduce previous results by reverting to older ver-
sions, and collaborate safely by branching off and isolating the changes.
Popular Data Version Control Systems
[DVC]: It stands for Data Version Control in short and is an open-
source, lightweight tool that works on top of Git Hub and supports all
CHAPTER 5. DATA ENGINEERING 111
5.10 Licensing
Many high-quality datasets either come from proprietary sources or
contain copyrighted information. This introduces licensing as a chal-
lenging legal domain. Companies eager to train ML systems must en-
gage in negotiations to obtain licenses that grant legal access to these
datasets. Furthermore, licensing terms can impose restrictions on data
applications and sharing methods. Failure to comply with these li-
censes can have severe consequences.
For instance, ImageNet, one of the most extensively utilized datasets
for computer vision research, is a case in point. Most of its images
were procured from public online sources without explicit permission,
sparking ethical concerns (Prabhu and Birhane, 2020). Accessing the
ImageNet dataset for corporations requires registration and adherence
to its terms of use, which restricts commercial usage (ImageNet, 2021).
Major players like Google and Microsoft invest significantly in licens-
ing datasets to improve their ML vision systems. However, the cost fac-
tor restricts accessibility for researchers from smaller companies with
constrained budgets.
The legal domain of data licensing has seen major cases that help
define fair use parameters. A prominent example is Authors Guild,
Inc. v. Google, Inc. This 2005 lawsuit alleged that Google’s book scan-
ning project infringed copyrights by displaying snippets without per-
mission. However, the courts ultimately ruled in Google’s favor, up-
holding fair use based on the transformative nature of creating a search-
able index and showing limited text excerpts. This precedent provides
some legal grounds for arguing fair use protections apply to indexing
datasets and generating representative samples for machine learning.
However, license restrictions remain binding, so a comprehensive anal-
ysis of licensing terms is critical. The case demonstrates why negotia-
tions with data providers are important to enable legal usage within
acceptable bounds.
New Data Regulations and Their Implications
New data regulations also impact licensing practices. The legislative
landscape is evolving with regulations like the EU’s Artificial Intelli-
gence Act, which is poised to regulate AI system development and use
within the European Union (EU). This legislation:
5.10. Licensing 116
removing it from the original dataset may not fully eliminate its impact
on the model’s behavior. New research is needed around the effects of
data removal on already-trained models and whether full retraining
is necessary to avoid retaining artifacts of deleted data. This presents
an important consideration when balancing data licensing obligations
with efÏciency and practicality in an evolving, deployed ML system.
Dataset licensing is a multifaceted domain that intersects tech-
nology, ethics, and law. Understanding these intricacies becomes
paramount for anyone building datasets during data engineering as
the world evolves.
5.11 Conclusion
Data is the fundamental building block of AI systems. Without qual-
ity data, even the most advanced machine learning algorithms will
fail. Data engineering encompasses the end-to-end process of collect-
ing, storing, processing, and managing data to fuel the development of
machine learning models. It begins with clearly defining the core prob-
lem and objectives, which guides effective data collection. Data can be
sourced from diverse means, including existing datasets, web scrap-
ing, crowdsourcing, and synthetic data generation. Each approach in-
volves tradeoffs between cost, speed, privacy, and specificity. Once
data is collected, thoughtful labeling through manual or AI-assisted an-
notation enables the creation of high-quality training datasets. Proper
storage in databases, warehouses, or lakes facilitates easy access and
analysis. Metadata provides contextual details about the data. Data
processing transforms raw data into a clean, consistent format for ma-
chine learning model development. Throughout this pipeline, trans-
parency through documentation and provenance tracking is crucial for
ethics, auditability, and reproducibility. Data licensing protocols also
govern legal data access and use. Key challenges in data engineering
include privacy risks, representation gaps, legal restrictions around
proprietary data, and the need to balance competing constraints like
speed versus quality. By thoughtfully engineering high-quality train-
ing data, machine learning practitioners can develop accurate, robust,
and responsible AI systems, including embedded and TinyML appli-
cations.
5.12 Resources
Here is a curated list of resources to support students and instructors
in their learning and teaching journeys. We are continuously working
5.12. Resources 118
Slides
• Feature engineering.
• Data Standards: Speech Commands.
• Crowdsourcing Data for the Long Tail.
Videos
• Coming soon.
Exercises
• Exercise 4
• Exercise 5
• Exercise 6
CHAPTER 5. DATA ENGINEERING 119
• Exercise 7
• Exercise 8
121
Chapter 6
AI Frameworks
Learning Objectives
6.1 Overview
Machine learning frameworks provide the tools and infrastructure
to efÏciently build, train, and deploy machine learning models. In
this chapter, we will explore the evolution and key capabilities of
major frameworks like TensorFlow (TF), PyTorch, and specialized
frameworks for embedded devices. We will dive into the components
like computational graphs, optimization algorithms, hardware ac-
CHAPTER 6. AI FRAMEWORKS 123
6.3.1 TF Ecosystem
1. TensorFlow Core: primary package that most developers engage
with. It provides a comprehensive, flexible platform for defining,
training, and deploying machine learning models. It includes
tf.keras as its high-level API.
versions can address some of these concerns, they may still be limited
in resource-constrained environments.
Tensors offer a flexible structure that can represent data in higher di-
mensions. Figure 6.6 illustrates how this concept applies to image data.
As shown in the figure, images are not represented by just one matrix
of pixel values. Instead, they typically have three channels, where each
channel is a matrix containing pixel values that represent the intensity
of red, green, or blue. Together, these channels create a colored image.
Without tensors, storing all this information from multiple matrices
can be complex. However, as Figure 6.6 illustrates, tensors make it
easy to contain image data in a single 3-dimensional structure, with
each number representing a certain color value at a specific location in
the image.
You don’t have to stop there. If we wanted to store a series of im-
ages, we could use a 4-dimensional tensor, where the new dimension
represents different images. This means you are storing multiple im-
ages, each having three matrices that represent the three color channels.
This gives you an idea of the usefulness of tensors when dealing with
multi-dimensional data efÏciently.
Tensors also have a unique attribute that enables frameworks to auto-
matically compute gradients, simplifying the implementation of com-
plex models and optimization algorithms. In machine learning, as dis-
6.4. Basic Framework Components 136
𝑑𝑦
= 2𝑥
𝑑𝑥
When 𝑥 = 2:
𝑑𝑦
= 2∗2 = 4
𝑑𝑥
The gradient of 𝑦 with respect to 𝑥, at 𝑥 = 2, is 4.
CHAPTER 6. AI FRAMEWORKS 137
6.4.2 PyTorch
import torch
# Output
tensor(4.0)
6.4.3 TensorFlow
import tensorflow as tf
# Output
tf.Tensor(4.0, shape=(), dtype=float32)
• Layers contain states like weights and biases. Tensors are state-
less, just holding data.
So, while tensors are a core data structure that layers consume and
produce, layers have additional functionality for defining parameter-
ized operations and training. While a layer configures tensor opera-
tions under the hood, the layer remains distinct from the tensor ob-
jects. The layer abstraction makes building and training neural net-
works much more intuitive. This abstraction enables developers to
build models by stacking these layers together without implementing
the layer logic. For example, calling tf.keras.layers.Conv2D in Ten-
sorFlow creates a convolutional layer. The framework handles comput-
ing the convolutions, managing parameters, etc. This simplifies model
development, allowing developers to focus on architecture rather than
low-level implementations. Layer abstractions use highly optimized
implementations for performance. They also enable portability, as the
same architecture can run on different hardware backends like GPUs
and TPUs.
In addition, computational graphs include activation functions
like ReLU, sigmoid, and tanh that are essential to neural networks,
and many frameworks provide these as standard abstractions. These
functions introduce non-linearities that enable models to approximate
complex functions. Frameworks provide these as simple, predefined
operations that can be used when constructing models, for example,
if.nn.relu in TensorFlow. This abstraction enables flexibility, as de-
velopers can easily swap activation functions for tuning performance.
Predefined activations are also optimized by the framework for faster
execution.
In recent years, models like ResNets and MobileNets have emerged
as popular architectures, with current frameworks pre-packaging
these as computational graphs. Rather than worrying about the fine
details, developers can use them as a starting point, customizing
as needed by substituting layers. This simplifies and speeds up
model development, avoiding reinventing architectures from scratch.
Predefined models include well-tested, optimized implementations
that ensure good performance. Their modular design also enables
transferring learned features to new tasks via transfer learning. These
predefined architectures provide high-performance building blocks
to create robust models quickly.
These layer abstractions, activation functions, and predefined
architectures the frameworks provide constitute a computa-
tional graph. When a user defines a layer in a framework (e.g.,
tf.keras.layers.Dense()), the framework configures computa-
tional graph nodes and edges to represent that layer. The layer
parameters like weights and biases become variables in the graph.
The layer computations become operation nodes (such as the x and
CHAPTER 6. AI FRAMEWORKS 141
models were defined in a separate context, and then a session was cre-
ated to run them. The benefit of static graphs is they allow more aggres-
sive optimization since the framework can see the full graph. However,
it also tends to be less flexible for research and interactivity. Changes
to the graph require re-declaring the full model.
For example:
x = tf.placeholder(tf.float32)
y = tf.matmul(x, weights) + biases
x = torch.randn(4,784)
y = torch.matmul(x, weights) + biases
Execution
Graph Pros Cons
Dynamic • Intuitive • Harder to optimize
(Define-by- imperative style without full graph
run) like Python code • Possible
• Interleave graph slowdowns from
build with graph building
execution during execution
• Easy to modify • Can require more
graphs memory
• Debugging
seamlessly fits
workflow
At the core of these pipelines are data loaders, which handle reading
training examples from sources like files, databases, and object storage.
Data loaders facilitate efÏcient data loading and preprocessing, crucial
for deep learning models. For instance, TensorFlow’s tf.data dataload-
ing pipeline is designed to manage this process. Depending on the
application, deep learning models require diverse data formats such
as CSV files or image folders. Some popular formats include:
• AUC-ROC - Area under ROC curve. They are used for classifica-
tion threshold analysis.
• Confusion Matrix - Matrix that shows the true positives, true neg-
atives, false positives, and false negatives. Provides a more de-
tailed view of classification performance.
6.4.9.2 Visualization
• Loss curves - Plot training and validation loss over time to spot
Overfitting.
larger grid sizes, leading to many errors. This leads to automatic dif-
ferentiation, which exploits the primitive functions that computers use
to represent operations to obtain an exact derivative. With automatic
differentiation, the computational complexity of computing the gra-
dient is proportional to computing the function itself. Intricacies of
automatic differentiation are not dealt with by end users now, but re-
sources to learn more can be found widely, such as from here. Today’s
automatic differentiation and differentiable programming are ubiqui-
tous and are done efÏciently and automatically by modern machine
learning frameworks.
ing frameworks (primarily TensorFlow here since the TPU directly in-
tegrates with it) are not supported by TPUs. They cannot also support
custom operations from the machine learning frameworks, and the net-
work design must closely align with the hardware capabilities.
Today, NVIDIA GPUs dominate training, aided by software libraries
like CUDA, cuDNN, and TensorRT. Frameworks also include optimiza-
tions to maximize performance on these hardware types, such as prun-
ing unimportant connections and fusing layers. Combining these tech-
niques with hardware acceleration provides greater efÏciency. For in-
ference, hardware is increasingly moving towards optimized ASICs
and SoCs. Google’s TPUs accelerate models in data centers, while Ap-
ple, Qualcomm, the NVIDIA Jetson family, and others now produce
AI-focused mobile chips.
6.6.1 Cloud
Cloud-based AI frameworks assume access to ample computational
power, memory, and storage resources in the cloud. They generally
support both training and inference. Cloud-based AI frameworks are
suited for applications where data can be sent to the cloud for process-
ing, such as cloud-based AI services, large-scale data analytics, and
web applications. Popular cloud AI frameworks include the ones we
mentioned earlier, such as TensorFlow, PyTorch, MXNet, Keras, etc.
These frameworks utilize GPUs, TPUs, distributed training, and Au-
toML to deliver scalable AI. Concepts like model serving, MLOps, and
AIOps relate to the operationalization of AI in the cloud. Cloud AI
powers services like Google Cloud AI and enables transfer learning
using pre-trained models.
6.6.2 Edge
Edge AI frameworks are tailored to deploy AI models on IoT devices,
smartphones, and edge servers. Edge AI frameworks are optimized
for devices with moderate computational resources, balancing power
and performance. Edge AI frameworks are ideal for applications re-
quiring real-time or near-real-time processing, including robotics, au-
tonomous vehicles, and smart devices. Key edge AI frameworks in-
clude TensorFlow Lite, PyTorch Mobile, CoreML, and others. They
CHAPTER 6. AI FRAMEWORKS 157
6.6.3 Embedded
TinyML frameworks are specialized for deploying AI models on
extremely resource-constrained devices, specifically microcontrollers
and sensors within the IoT ecosystem. TinyML frameworks are
designed for devices with limited resources, emphasizing minimal
memory and power consumption. TinyML frameworks are special-
ized for use cases on resource-constrained IoT devices for predictive
maintenance, gesture recognition, and environmental monitoring ap-
plications. Major TinyML frameworks include TensorFlow Lite Micro,
uTensor, and ARM NN. They optimize complex models to fit within
kilobytes of memory through techniques like quantization-aware
training and reduced precision. TinyML allows intelligent sensing
across battery-powered devices, enabling collaborative learning via
federated learning. The choice of framework involves balancing model
performance and computational constraints of the target platform,
whether cloud, edge, or TinyML. Table 6.3 compares the major AI
frameworks across cloud, edge, and TinyML environments:
Table 6.3: Comparison of framework types for Cloud AI, Edge AI, and
TinyML.
Framework
Type Examples Key Technologies Use Cases
Cloud TensorFlow, GPUs, TPUs, Cloud services, web
AI PyTorch, distributed training, apps, big data
MXNet, AutoML, MLOps analytics
Keras
Edge TensorFlow Model optimization, Mobile apps,
AI Lite, compression, autonomous
PyTorch quantization, efÏcient systems, real-time
Mobile, NN architectures processing
Core ML
6.7. Embedded AI Frameworks 158
Framework
Type Examples Key Technologies Use Cases
TinyMLTensorFlow Quantization-aware IoT sensors,
Lite Micro, training, reduced wearables,
uTensor, precision, neural predictive
ARM NN architecture search maintenance,
gesture recognition
Key differences:
1. Model Size: AI models are too large to fit on embedded and IoT
devices. This necessitates model compression techniques, such
as quantization, pruning, and knowledge distillation. Addition-
ally, as we will see, many of the frameworks used by developers
for AI development have large amounts of overhead and built-in
libraries that embedded systems can’t support.
6.7.3 Challenges
While embedded systems present an enormous opportunity for de-
ploying machine learning to enable intelligent capabilities at the edge,
these resource-constrained environments pose significant challenges.
Unlike typical cloud or desktop environments rich with computa-
tional resources, embedded devices introduce severe constraints
around memory, processing power, energy efÏciency, and specialized
hardware. As a result, existing machine learning techniques and
frameworks designed for server clusters with abundant resources do
not directly translate to embedded systems. This section uncovers
some of the challenges and opportunities for embedded systems and
ML frameworks.
6.7.3.7 Summary
6.8 Examples
Machine learning deployment on microcontrollers and other em-
bedded devices often requires specially optimized software libraries
and frameworks to work within tight memory, compute, and power
constraints. Several options exist for performing inference on such
resource-limited hardware, each with its approach to optimizing
model execution. This section will explore the key characteristics and
design principles behind TFLite Micro, TinyEngine, and CMSIS-NN,
providing insight into how each framework tackles the complex
problem of high-accuracy yet efÏcient neural network execution
on microcontrollers. It will also showcase different approaches for
implementing efÏcient TinyML frameworks.
Table 6.4 summarizes the key differences and similarities between
these three specialized machine-learning inference frameworks for em-
bedded systems and microcontrollers.
CHAPTER 6. AI FRAMEWORKS 163
6.8.1 Interpreter
TensorFlow Lite Micro (TFLM) is a machine learning inference frame-
6.8. Examples 164
6.8.2 Compiler-based
TinyEngine is an ML inference framework designed specifically for
resource-constrained microcontrollers. It employs several optimiza-
tions to enable high-accuracy neural network execution within the
CHAPTER 6. AI FRAMEWORKS 165
6.8.3 Library
CMSIS-NN, standing for Cortex Microcontroller Software Interface
Standard for Neural Networks, is a software library devised by
ARM. It offers a standardized interface for deploying neural network
inference on microcontrollers and embedded systems, focusing on
optimization for ARM Cortex-M processors (Lai, Suda, and Chandra
2018a).
Neural Network Kernels: CMSIS-NN has highly efÏcient kernels
that handle fundamental neural network operations such as convolu-
tion, pooling, fully connected layers, and activation functions. It caters
to a broad range of neural network models by supporting floating and
fixed-point arithmetic. The latter is especially beneficial for resource-
constrained devices as it curtails memory and computational require-
ments (Quantization).
Hardware Acceleration: CMSIS-NN harnesses the power of Single
Instruction, Multiple Data (SIMD) instructions available on many
Cortex-M processors. This allows for parallel processing of multiple
data elements within a single instruction, thereby boosting com-
putational efÏciency. Certain Cortex-M processors feature Digital
Signal Processing (DSP) extensions that CMSIS-NN can exploit for
accelerated neural network execution. The library also incorporates
assembly-level optimizations tailored to specific microcontroller
architectures to improve performance further.
Standardized API: CMSIS-NN offers a consistent and abstracted
API that protects developers from the complexities of low-level hard-
ware details. This makes the integration of neural network models
into applications simpler. It may also encompass tools or utilities for
converting popular neural network model formats into a format that
is compatible with CMSIS-NN.
Memory Management: CMSIS-NN provides functions for efÏcient
memory allocation and management, which is vital in embedded sys-
tems where memory resources are scarce. It ensures optimal memory
usage during inference and, in some instances, allows in-place opera-
tions to decrease memory overhead.
Portability: CMSIS-NN is designed for portability across vari-
ous Cortex-M processors. This enables developers to write code
that can operate on different microcontrollers without significant
modifications.
Low Latency: CMSIS-NN minimizes inference latency, making it an
ideal choice for real-time applications where swift decision-making is
paramount.
Energy EfÏciency: The library is designed with a focus on energy
efÏciency, making it suitable for battery-powered and energy-
CHAPTER 6. AI FRAMEWORKS 167
constrained devices.
6.9.1 Model
Figure 6.14 illustrates the key differences between TensorFlow variants,
particularly in terms of supported operations (ops) and features. Ten-
sorFlow supports significantly more operations than TensorFlow Lite
and TensorFlow Lite Micro, as it is typically used for research or cloud
deployment, which require a large number of and more flexibility with
operators.
The figure clearly demonstrates this difference in op support across
the frameworks. TensorFlow Lite supports select ops for on-device
training, whereas TensorFlow Micro does not. Additionally, the
figure shows that TensorFlow Lite supports dynamic shapes and
6.9. Choosing the Right Framework 168
6.9.2 Software
As shown in Figure 6.15, TensorFlow Lite Micro does not have OS sup-
port, while TensorFlow and TensorFlow Lite do. This design choice for
TensorFlow Lite Micro helps reduce memory overhead, make startup
times faster, and consume less energy. Instead, TensorFlow Lite Micro
can be used in conjunction with real-time operating systems (RTOS)
like FreeRTOS, Zephyr, and Mbed OS.
The figure also highlights an important memory management fea-
ture: TensorFlow Lite and TensorFlow Lite Micro support model mem-
ory mapping, allowing models to be directly accessed from flash stor-
age rather than loaded into RAM. In contrast, TensorFlow does not
offer this capability.
6.9.3 Hardware
TensorFlow Lite and TensorFlow Lite Micro have significantly smaller
base binary sizes and memory footprints than TensorFlow (see
CHAPTER 6. AI FRAMEWORKS 169
6.9.4.1 Performance
6.9.4.2 Scalability
that cater to TinyML needs. TensorFlow Lite Micro is the most popular
and has the most community support.
This has led to vertical (i.e., between abstraction levels) and hori-
zontal (i.e., library-driven vs. compilation-driven approaches to ten-
6.11. Conclusion 172
6.11 Conclusion
In summary, selecting the optimal machine learning framework re-
quires a thorough evaluation of various options against criteria such
as usability, community support, performance, hardware compatibil-
ity, and model conversion capabilities. There is no one-size-fits-all so-
lution, as the right framework depends on specific constraints and use
cases.
We first introduced the necessity of machine learning frameworks
like TensorFlow and PyTorch. These frameworks offer features such
as tensors for handling multi-dimensional data, computational graphs
for defining and optimizing model operations, and a suite of tools in-
cluding loss functions, optimizers, and data loaders that streamline
model development.
CHAPTER 6. AI FRAMEWORKS 173
6.12 Resources
Here is a curated list of resources to support students and instructors
in their learning and teaching journeys. We are continuously working
on expanding this collection and will add new exercises soon.
Slides
• Frameworks overview.
• Embedded systems software.
• Inference engines: TF vs. TFLite.
• TF flavors: TF vs. TFLite vs. TFLite Micro.
6.12. Resources 174
• TFLite Micro:
– TFLite Micro Big Picture.
– TFLite Micro Interpreter.
– TFLite Micro Model Format.
– TFLite Micro Memory Allocation.
– TFLite Micro NN Operations.
Videos
• Coming soon.
Exercises
• Exercise 9
• Exercise 10
• Exercise 11
175
Chapter 7
AI Training
Learning Objectives
7.1 Overview
Training is critical for developing accurate and useful AI systems using
machine learning. The training creates a machine learning model that
can generalize to new, unseen data rather than memorizing the training
examples. This is done by feeding training data into algorithms that
learn patterns from these examples by adjusting internal parameters.
The algorithms minimize a loss function, which compares their pre-
dictions on the training data to the known labels or solutions, guiding
the learning. Effective training often requires high-quality, represen-
CHAPTER 7. AI TRAINING 177
We will walk you through these details in the rest of the sections. Un-
derstanding how to effectively leverage data, algorithms, parameter
optimization, and generalization through thorough training is essen-
tial for developing capable, deployable AI systems that work robustly
in the real world.
work reliably minimizes the loss, indicating it has learned the patterns
in the data.
How is this process defined mathematically? Formally, neural net-
works are mathematical models that consist of alternating linear and
nonlinear operations, parameterized by a set of learnable weights that
are trained to minimize some loss function. This loss function mea-
sures how good our model is concerning fitting our training data, and
it produces a numerical value when evaluated on our model against
the training data. Training neural networks involves repeatedly evalu-
ating the loss function on many different data points to measure how
good our model is, then continuously tweaking the weights of our
model using backpropagation so that the loss decreases, ultimately op-
timizing the model to fit our data.
Where:
𝑊𝑗 ∈ ℝ𝑑𝑗 ×𝑑𝑗−1
CHAPTER 7. AI TRAINING 181
𝑀
𝑦 = 𝑔 (∑ 𝑤𝑗𝑘 𝐴𝑗 )
𝑗=1
Where:
Note
Note
𝐿 ∶ ℝ𝑑𝑛 × ℝ𝑑𝑛 ⟶ ℝ
1 𝑀
𝐿𝑓𝑢𝑙𝑙 = ∑ 𝐿(𝑁 (𝑥𝑗 ; 𝑊1 , ...𝑊𝑛 ), 𝑦𝑗 )
𝑀 𝑗=1
1 𝑀
= 𝑚𝑖𝑛𝑊1 ,...,𝑊𝑛 ∑ 𝐿(𝑁 (𝑥𝑗 ; 𝑊1 , ...𝑊𝑛 ), 𝑦𝑗 )
𝑀 𝑗=1
𝜕𝐿𝑓𝑢𝑙𝑙
𝑊𝑖 ∶= 𝑊𝑖 − 𝜆 for 𝑖 = 1..𝑛
𝜕𝑊𝑖
Note
7.2.4 Backpropagation
Training neural networks involve repeated applications of the gradient
descent algorithm, which involves computing the derivative of the loss
function with respect to the 𝑊𝑖 s. How do we compute the loss deriva-
tive concerning the 𝑊𝑖 s, given that the 𝑊𝑖 s are nested functions of each
CHAPTER 7. AI TRAINING 185
other in a deep neural network? The trick is to leverage the chain rule:
we can compute the derivative of the loss concerning the 𝑊𝑖 s by re-
peatedly applying the chain rule in a complete process known as back-
propagation. Specifically, we can calculate the gradients by computing
the derivative of the loss concerning the outputs of the last layer, then
progressively use this to compute the derivative of the loss concerning
each prior layer to the input layer. This process starts from the end of
the network (the layer closest to the output) and progresses backwards,
and hence gets its name backpropagation.
Let’s break this down. We can compute the derivative of the loss con-
cerning the outputs of each layer of the neural network by using repeated
applications of the chain rule.
Note
Note
Example of Backpropagation
Suppose we have a two-layer neural network
𝐿1 = 𝑊1 𝐴0
𝐴1 = 𝑅𝑒𝐿𝑈 (𝐿1 )
𝐿2 = 𝑊2 𝐴1
𝐴2 = 𝑅𝑒𝐿𝑈 (𝐿2 )
𝑁 𝑁 (𝑥) = Let 𝐴0 = 𝑥 then output 𝐴2
where 𝑊1 ∈ ℝ30×100 and 𝑊2 ∈ ℝ1×30 . Furthermore, sup-
pose we use the MSE loss function:
𝐿(𝑥, 𝑦) = (𝑥 − 𝑦)2
We wish to compute
𝜕𝐿(𝑁 𝑁 (𝑥), 𝑦)
for 𝑖 = 1, 2
𝜕𝑊𝑖
Note the following:
𝜕𝐿(𝑥, 𝑦)
= 2 × (𝑥 − 𝑦)
𝜕𝑥
𝜕𝑅𝑒𝐿𝑈 (𝑥) 0 for 𝑥 ≤ 0
𝛿={ }⊙𝛿
𝜕𝑥 1 for 𝑥 ≥ 0
𝜕𝑊 𝐴
𝛿 = 𝑊𝑇𝛿
𝜕𝐴
𝜕𝑊 𝐴
𝛿 = 𝛿𝐴𝑇
𝜕𝑊
CHAPTER 7. AI TRAINING 187
Then we have
𝜕𝐿(𝑁 𝑁 (𝑥), 𝑦) 𝜕𝐿2 𝜕𝐴2 𝜕𝐿(𝑁 𝑁 (𝑥), 𝑦)
=
𝜕𝑊2 𝜕𝑊2 𝜕𝐿2 𝜕𝐴2
= (2𝐿(𝑁 𝑁 (𝑥) − 𝑦) ⊙ 𝑅𝑒𝐿𝑈 ′ (𝐿2 ))𝐴𝑇1
and
𝜕𝐿(𝑁 𝑁 (𝑥), 𝑦) 𝜕𝐿1 𝜕𝐴1 𝜕𝐿2 𝜕𝐴2 𝜕𝐿(𝑁 𝑁 (𝑥), 𝑦)
=
𝜕𝑊1 𝜕𝑊1 𝜕𝐿1 𝜕𝐴1 𝜕𝐿2 𝜕𝐴2
= [𝑅𝑒𝐿𝑈 ′ (𝐿1 ) ⊙ (𝑊2𝑇 [2𝐿(𝑁 𝑁 (𝑥) − 𝑦) ⊙ 𝑅𝑒𝐿𝑈 ′ (𝐿2 )])]𝐴𝑇0
Tip
Double-check your work by making sure that the shapes are cor-
rect!
Note
to train the model parameters. The validation set evaluates the model
during training to tune hyperparameters and prevent overfitting. The
test set provides an unbiased final evaluation of the trained model’s
performance.
Maintaining clear splits between train, validation, and test sets with
representative data is crucial to properly training, tuning, and evalu-
ating models to achieve the best real-world performance. To this end,
we will learn about the common pitfalls or mistakes people make when
creating these data splits.
Table 7.1 compares the differences between training, validation, and
test data splits:
The training set is used to train the model. It is the largest subset, typi-
cally 60-80% of the total data. The model sees and learns from the train-
ing data to make predictions. A sufÏciently large and representative
training set is required for the model to learn the underlying patterns
effectively.
The validation set evaluates the model during training, usually after
each epoch. Typically, 20% of the data is allocated for the validation
set. The model does not learn or update its parameters based on the
validation data. It is used to tune hyperparameters and make other
tweaks to improve training. Monitoring metrics like loss and accuracy
on the validation set prevents overfitting on just the training data.
The test set acts as a completely unseen dataset that the model did not
see during training. It is used to provide an unbiased evaluation of the
final trained model. Typically, 20% of the data is reserved for testing.
Maintaining a hold-out test set is vital for obtaining an accurate esti-
mate of how the trained model would perform on real-world unseen
data. Data leakage from the test set must be avoided at all costs.
The relative proportions of the training, validation, and test sets can
vary based on data size and application. However, following the gen-
eral guidelines for a 60/20/20 split is a good starting point. Careful
data splitting ensures models are properly trained, tuned, and evalu-
ated to achieve the best performance.
Video 5 explains how to properly split the dataset into training, val-
idation, and testing sets, ensuring an optimal training process.
https://www.youtube.com/watch?v=1waHlpKiNyY
Allocating too little data to the training set is a common mistake when
splitting data that can severely impact model performance. If the train-
ing set is too small, the model will not have enough samples to effec-
tively learn the true underlying patterns in the data. This leads to high
variance and causes the model to fail to generalize well to new data.
For example, if you train an image classification model to recognize
handwritten digits, providing only 10 or 20 images per digit class
would be completely inadequate. The model would need more
examples to capture the wide variances in writing styles, rotations,
stroke widths, and other variations.
As a rule of thumb, the training set size should be at least hundreds
or thousands of examples for most machine learning algorithms to
work effectively. Due to the large number of parameters, the training
set often needs to be in the tens or hundreds of thousands for deep
neural networks, especially those using convolutional layers.
InsufÏcient training data typically manifests in symptoms like high
error rates on validation/test sets, low model accuracy, high variance,
and overfitting on small training set samples. Collecting more quality
training data is the solution. Data augmentation techniques can also
help virtually increase the size of training data for images, audio, etc.
Carefully factoring in the model complexity and problem difÏculty
when allocating training samples is important to ensure sufÏcient data
is available for the model to learn successfully. Following guidelines
on minimum training set sizes for different algorithms is also recom-
mended. More training data is needed to maintain the overall success
of any machine learning application.
Consider Figure 7.5 where we try to classify/split datapoints into
two categories (here, by color): On the left, overfitting is depicted by a
model that has learned the nuances in the training data too well (either
the dataset was too small or we ran the model for too long), causing it
to follow the noise along with the signal, as indicated by the line’s ex-
cessive curves. The right side shows underfitting, where the model’s
simplicity prevents it from capturing the dataset’s underlying struc-
ture, resulting in a line that does not fit the data well. The center graph
represents an ideal fit, where the model balances well between general-
ization and fitting, capturing the main trend of the data without being
swayed by outliers. Although the model is not a perfect fit (it misses
some points), we care more about its ability to recognize general pat-
terns rather than idiosyncratic outliers.
Figure 7.6 illustrates the process of fitting the data over time. When
training, we search for the “sweet spot” between underfitting and over-
fitting. At first when the model hasn’t had enough time to learn the pat-
7.4. Training Data 192
Important 6: Bias/Variance
https://www.youtube.com/watch?v=SjQyLhQIXSM
real data distribution. However, this can make model selection and
tuning more challenging.
For example, if the validation set only contains 100 samples, the met-
rics calculated will have a high variance. Due to noise, the accuracy
may fluctuate up to 5-10% between epochs. This makes it difÏcult to
know if a drop in validation accuracy is due to overfitting or natural
variance. With a larger validation set, say 1000 samples, the metrics
will be much more stable.
Additionally, if the validation set is not representative, perhaps miss-
ing certain subclasses, the estimated skill of the model may be inflated.
This could lead to poor hyperparameter choices or premature training
stops. Models selected based on such biased validation sets do not gen-
eralize well to real data.
A good rule of thumb is that the validation set size should be at
least several hundred samples and up to 10-20% of the training set,
while still leaving sufÏcient samples for training. The splits should
also be stratified, meaning that the class proportions in the validation
set should match those in the full dataset, especially if working with
imbalanced datasets. A larger validation set representing the original
data characteristics is essential for proper model selection and tuning.
should be used for all parameter tuning, model selection, early stop-
ping, and similar tasks. It’s important to reserve a portion, such as 20-
30% of the full dataset, solely for the final model evaluation. This data
should not be used for validation, tuning, or model selection during
development.
Failing to keep an unseen hold-out set for final validation risks opti-
mizing results and overlooking potential failures before model release.
Having some fresh data provides a final sanity check on real-world
efÏcacy. Maintaining the complete separation of training/validation
from the test set is essential to obtain accurate estimates of model per-
formance. Even minor deviations from a single use of the test set could
positively bias results and metrics, providing an overly optimistic view
of real-world efÏcacy.
When splitting data into training, validation, and test sets, failing to
stratify the splits can result in an uneven representation of the target
classes across the splits and introduce sampling bias. This is especially
problematic for imbalanced datasets.
Stratified splitting involves sampling data points such that the pro-
portion of output classes is approximately preserved in each split. For
example, if performing a 70/30 train-test split on a dataset with 60%
negative and 40% positive samples, stratification ensures ~60% nega-
tive and ~40% positive examples in both training and test sets.
Without stratification, random chance could result in the training
split having 70% positive samples while the test has 30% positive sam-
ples. The model trained on this skewed training distribution will not
generalize well. Class imbalance also compromises model metrics like
accuracy.
Stratification works best when done using labels, though proxies like
clustering can be used for unsupervised learning. It becomes essential
for highly skewed datasets with rare classes that could easily be omit-
ted from splits.
Libraries like Scikit-Learn have stratified splitting methods built into
them. Failing to use them could inadvertently introduce sampling bias
and hurt model performance on minority groups. After performing
the splits, the overall class balance should be examined to ensure even
representation across the splits.
Stratification provides a balanced dataset for both model training
and evaluation. Though simple random splitting is easy, mindful of
stratification needs, especially for real-world imbalanced data, results
in more robust model development and evaluation.
A common mistake when splitting data is failing to set aside some por-
tion of the data just for the final evaluation of the completed model.
All of the data is used for training, validation, and test sets during de-
velopment.
This leaves no unseen data to get an unbiased estimate of how the
final tuned model would perform in the real world. The metrics on
the test set used during development may only partially reflect actual
model skills.
For example, choices like early stopping and hyperparameter tun-
ing are often optimized based on test set performance. This couples
the model to the test data. An unseen dataset is needed to break this
coupling and get true real-world metrics.
Best practice is to reserve a portion, such as 20-30% of the full dataset,
solely for final model evaluation. This data should not be used for val-
idation, tuning, or model selection during development.
Saving some unseen data allows for evaluating the completely
trained model as a black box on real-world data. This provides
reliable metrics to decide whether the model is ready for production
deployment.
Failing to keep an unseen hold-out set for final validation risks opti-
mizing results and overlooking potential failures before model release.
Having some fresh data provides a final sanity check on real-world
efÏcacy.
The validation set is meant to guide the model training process, not
serve as additional training data. Overoptimizing the validation set to
maximize performance metrics treats it more like a secondary training
set, leading to inflated metrics and poor generalization.
7.5. Optimization Algorithms 198
7.5.1 Optimizations
Over the years, various optimizations have been proposed to acceler-
ate and improve vanilla SGD. Ruder (2016) gives an excellent overview
of the different optimizers. Briefly, several commonly used SGD opti-
mization techniques include:
CHAPTER 7. AI TRAINING 199
7.5.2 Tradeoffs
Table 7.2 is a pros and cons table for some of the main optimization
algorithms for neural network training:
7.5. Optimization Algorithms 200
Table 7.2: Comparing the pros and cons of different optimization algo-
rithms.
Algorithm Pros Cons
Momentum • Faster convergence • Requires tuning
due to acceleration of momentum
along gradients parameter
• Less oscillation than
vanilla SGD
Nesterov • Faster than standard • More complex to
Accelerated momentum in some understand
Gradient cases intuitively
(NAG) • Anticipatory updates
prevent overshooting
Adagrad • Eliminates need to • Learning rate
tune learning rates may decay too
manually quickly on dense
• Performs well on gradients
sparse gradients
Adadelta • Less aggressive • Still sensitive to
learning rate decay initial learning
than Adagrad rate value
RMSProp • Automatically adjusts • No major
learning rates downsides
• Works well in practice
Adam • Combination of • Slightly worse
momentum and generalization
adaptive learning performance in
rates some cases
• EfÏcient and fast
convergence
AMSGrad • Improvement to • Not as
Adam addressing extensively
generalization issue used/tested as
Adam
ers rely on to search through the vast space of possible model config-
urations systematically. Some of the most prominent hyperparameter
search algorithms include:
• Grid Search: The most basic search method, where you man-
ually define a grid of values to check for each hyperparameter.
For example, checking learning rates = [0.01, 0.1, 1] and
batch sizes = [32, 64, 128]. The key advantage is simplicity,
but it can lead to an exponential explosion in search space, mak-
ing it time-consuming. It’s best suited for fine-tuning a small
number of parameters.
7.6.3.1 BigML
7.6.3.2 TinyML
Important 7: Hyperparameter
https://www.youtube.com/watch?v=AXDByU3D1hA&list=
PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=24
7.7 Regularization
Regularization is a critical technique for improving the performance
and generalizability of machine learning models in applied settings. It
refers to mathematically constraining or penalizing model complexity
to avoid overfitting the training data. Without regularization, complex
ML models are prone to overfitting the dataset and memorizing pecu-
7.7. Regularization 208
liarities and noise in the training set rather than learning meaningful
patterns. They may achieve high training accuracy but perform poorly
when evaluating new unseen inputs.
Regularization helps address this problem by placing constraints
that favor simpler, more generalizable models that don’t latch onto
sampling errors. Techniques like L1/L2 regularization directly penal-
ize large parameter values during training, forcing the model to use
the smallest parameters that can adequately explain the signal. Early
stopping rules halt training when validation set performance stops im-
proving - before the model starts overfitting.
Appropriate regularization is crucial when deploying models to new
user populations and environments where distribution shifts are likely.
For example, an irregularized fraud detection model trained at a bank
may work initially but accrue technical debt over time as new fraud
patterns emerge.
Regularizing complex neural networks also offers computational
advantages—smaller models require less data augmentation, compute
power, and data storage. Regularization also allows for more efÏcient
AI systems, where accuracy, robustness, and resource management
are thoughtfully balanced against training set limitations.
Several powerful regularization techniques are commonly used to
improve model generalization. Architecting the optimal strategy re-
quires understanding how each method affects model learning and
complexity.
7.7.1 L1 and L2
Two of the most widely used regularization forms are L1 and L2 reg-
ularization. Both penalize model complexity by adding an extra term
to the cost function optimized during training. This term grows larger
as model parameters increase.
L2 regularization, also known as ridge regression, adds the sum of
squared magnitudes of all parameters multiplied by a coefÏcient α.
This quadratic penalty curtails extreme parameter values more aggres-
sively than L1 techniques. Implementation requires only changing the
cost function and tuning α.
𝑛
𝑅𝐿2 (Θ) = 𝛼 ∑ 𝜃𝑖2
𝑖=1
Where:
Where:
Important 8: Regularization
https://www.youtube.com/watch?v=6g0t3Phly2M&list=
PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=4
https://www.youtube.com/watch?v=NyG-7nRpsW8&list=
PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=5
7.7.2 Dropout
Another widely adopted regularization method is dropout (Srivastava
et al. 2014). During training, dropout randomly sets a fraction 𝑝 of
node outputs or hidden activations to zero. This encourages greater
information distribution across more nodes rather than reliance on a
small number of nodes. Come prediction time; the full neural network
is used, with intermediate activations scaled by 1 − 𝑝 to maintain out-
put magnitudes. GPU optimizations make implementing dropout efÏ-
ciently straightforward via frameworks like PyTorch and TensorFlow.
CHAPTER 7. AI TRAINING 211
𝑖 = 𝑟𝑖 ⊙ 𝑎 𝑖
Where:
• 𝑎𝑖 - output of node 𝑖
• 𝑖 - output of node 𝑖 after dropout
• 𝑟𝑖 - independent Bernoulli random variable with probability 1 −
𝑝 of being 1
• ⊙ - elementwise multiplication
𝑎𝑡𝑒𝑠𝑡
𝑖 = (1 − 𝑝)𝑎𝑖
Where:
• 𝑎𝑡𝑒𝑠𝑡
𝑖 - node output at test time
• 𝑝 - the probability of dropping a node.
https://www.youtube.com/watch?v=ARq74QuavAo&list=
PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=7
7.7. Regularization 212
https://www.youtube.com/watch?v=BOCLq2gpcGU&list=
PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=8
7.8.1 Sigmoid
The sigmoid activation applies a squashing S-shaped curve tightly
binding the output between 0 and 1. It has the mathematical form:
1
𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑥) =
1 + 𝑒−𝑥
The exponentiation transform allows the function to smoothly transi-
tion from near 0 towards near 1 as the input moves from very negative
to very positive. The monotonic rise covers the full (0,1) range.
7.8. Activation Functions 214
7.8.2 Tanh
Tanh or hyperbolic tangent also assumes an S-shape but is zero-
centered, meaning the average output value is 0.
𝑒𝑥 − 𝑒−𝑥
𝑡𝑎𝑛ℎ(𝑥) =
𝑒𝑥 + 𝑒−𝑥
The numerator/denominator transform shifts the range from (0,1)
in Sigmoid to (-1, 1) in tanh.
Most pros/cons are shared with Sigmoid, but Tanh avoids some out-
put saturation issues by being centered. However, it still suffers from
vanishing gradients with many layers.
7.8.3 ReLU
The Rectified Linear Unit (ReLU) introduces a simple thresholding be-
havior with its mathematical form:
7.8.4 Softmax
The softmax activation function is generally used as the last layer for
classification tasks to normalize the activation value vector so that its
elements sum to 1. This is useful for classification tasks where we want
to learn to predict class-specific probabilities of a particular input, in
CHAPTER 7. AI TRAINING 215
Table 7.4: Comparing the pros and cons of different optimization algo-
rithms.
Activation Pros Cons
Sigmoid • Smooth gradient for • Saturation
backdrop kills gradients
• Output bounded between • Not
0 and 1 zero-centered
Tanh • Smoother gradient than • Still suffers
sigmoid vanishing
• Zero-centered output [-1, gradient issue
1]
ReLU • Computationally efÏcient • “Dying ReLU”
• Introduces sparsity units
• Avoids vanishing • Not bounded
gradients
7.9. Weight Initialization 216
7.9.3 He Initialization
Get your neural network off to a strong start with weight initial-
ization! How you set those initial weights can make or break
your model’s training. Think of it like tuning the instruments in
an orchestra before the concert. In this Colab notebook, you’ll
learn that the right initialization strategy can save time, improve
model performance, and make your deep-learning journey much
smoother.
https://www.youtube.com/watch?v=s2coXdufOzE&list=
PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=11
the size of the layer dimensions. The total computations across 𝐿 lay-
𝐿−1
ers can be expressed as ∑𝑙=1 𝑂(𝑁 (𝑙) ⋅ 𝑀 (𝑙−1) ), where the computation
required for each layer is dependent on the product of the input and
output dimensions of the matrices being multiplied.
Now, comparing the matrix multiplication to the activation function,
which requires only 𝑂(𝑁 ) = 1000 element-wise nonlinearities for 𝑁 =
1000 outputs, we can see the linear transformations dominating the
activations computationally.
These large matrix multiplications impact hardware choices,
inference latency, and power constraints for real-world neural net-
work applications. For example, a typical DNN layer may require
500,000 multiply-accumulates vs. only 1000 nonlinear activations,
demonstrating a 500x increase in mathematical operations.
When training neural networks, we typically use mini-batch gradi-
ent descent, operating on small batches of data simultaneously. Con-
sidering a batch size of 𝐵 training examples, the input to the matrix
multiplication becomes a 𝑀 × 𝐵 matrix, while the output is an 𝑁 × 𝐵
matrix.
7.10.1.2 Mini-batch
The batch size used during neural network training and inference sig-
nificantly impacts whether matrix multiplication poses more of a com-
putational or memory bottleneck. Concretely, the batch size refers to
the number of samples propagated through the network together in
one forward/backward pass. Matrix multiplication equates to larger
matrix sizes.
Specifically, let’s look at the arithmetic intensity of matrix multiplica-
tion during neural network training. This measures the ratio between
computational operations and memory transfers. The matrix multi-
ply of two matrices of size 𝑁 × 𝑀 and 𝑀 × 𝐵 requires 𝑁 × 𝑀 × 𝐵
multiply-accumulate operations, but only transfers of 𝑁 × 𝑀 + 𝑀 × 𝐵
matrix elements.
As we increase the batch size 𝐵, the number of arithmetic operations
grows faster than the memory transfers. For example, with a batch
size of 1, we need 𝑁 × 𝑀 operations and 𝑁 + 𝑀 transfers, giving an
7.10. System Bottlenecks 224
Modern hardware like CPUs and GPUs is highly optimized for com-
putational throughput rather than memory bandwidth. For example,
high-end H100 Tensor Core GPUs can deliver over 60 TFLOPS of
double-precision performance but only provide up to 3 TB/s of mem-
ory bandwidth. This means there is almost a 20x imbalance between
arithmetic units and memory access; consequently, for hardware like
CHAPTER 7. AI TRAINING 225
7.11.3 Comparison
To summarize, Table 7.5 demonstrates some of the key characteristics
for comparing data parallelism and model parallelism.
7.12 Conclusion
In this chapter, we have covered the core foundations that enable
effective training of artificial intelligence models. We explored the
mathematical concepts like loss functions, backpropagation, and
gradient descent that make neural network optimization possible.
We also discussed practical techniques around leveraging training
data, regularization, hyperparameter tuning, weight initialization,
and distributed parallelization strategies that improve convergence,
generalization, and scalability.
These methodologies form the bedrock through which the success
of deep learning has been attained over the past decade. Mastering
these fundamentals equips practitioners to architect systems and
refine models tailored to their problem context. However, as models
and datasets grow exponentially, training systems must optimize
across metrics like time, cost, and carbon footprint. Hardware scaling
through warehouse scales enables massive computational throughput
- but optimizations around efÏciency and specialization will be key.
Software techniques like compression and sparsity exploitation can
increase hardware gains. We will discuss several of these in the
coming chapters.
Overall, the fundamentals covered in this chapter equip practition-
ers to build, refine, and deploy models. However, interdisciplinary
skills spanning theory, systems, and hardware will differentiate ex-
perts who can lift AI to the next level sustainably and responsibly that
society requires. Understanding efÏciency alongside accuracy consti-
tutes the balanced engineering approach needed to train intelligent sys-
tems that integrate smoothly across many real-world contexts.
7.13 Resources
Here is a curated list of resources to support students and instructors
in their learning and teaching journeys. We are continuously working
on expanding this collection and will be adding new exercises soon.
CHAPTER 7. AI TRAINING 231
Slides
Videos
• Video 5
• Video 6
• Video 7
• Video 8
• Video 9
• Video 10
• Video 11
• Video 12
7.13. Resources 232
Exercises
• Exercise 12
• Exercise 13
• Exercise 14
• Exercise 16
• Exercise 15
Labs
• Coming soon.
233
Chapter 8
EfÏcient AI
Learning Objectives
8.1 Overview
Training models can consume significant energy, sometimes equiva-
lent to the carbon footprint of sizable industrial processes. We will
cover some of these sustainability details in the AI Sustainability chap-
ter. On the deployment side, if these models are not optimized for
efÏciency, they can quickly drain device batteries, demand excessive
memory, or fall short of real-time processing needs. Through this chap-
ter, we aim to elucidate the nuances of efÏciency, setting the ground-
work for a comprehensive exploration in the subsequent chapters.
CHAPTER 8. EFFICIENT AI 235
model to the lightweight one. Hence, the smaller model attains per-
formance close to its larger counterpart but with significantly fewer
parameters. Figure 8.5 demonstrates the tutor-student framework for
knowledge distillation. We will explore knowledge distillation in more
detail in the Section 9.2.2.1.
values that can be represented, and the mantissa determines the pre-
cision of the number. The combination of these components allows
floating point numbers to represent a vast range of values with vary-
ing degrees of precision.
Video 13 provides a comprehensive overview of these three main
components - sign, exponent, and mantissa - and how they work to-
gether to represent floating point numbers.
https://youtu.be/gc1Nl3mmCuY?si=nImcymfbE5H392vu
This structure prioritizes range over precision. BF16 has achieved train-
ing results comparable in accuracy to FP32 while using significantly
less memory and computational resources (Kalamkar et al. 2019). This
makes it suitable not just for inference but also for training deep neural
networks.
By retaining the 8-bit exponent of FP32, BF16 offers a similar range,
which is crucial for deep learning tasks where certain operations can
result in very large or very small numbers. At the same time, by trun-
cating precision, BF16 allows for reduced memory and computational
requirements compared to FP32. BF16 has emerged as a promising
middle ground in the landscape of numerical formats for deep learn-
ing, providing an efÏcient and effective alternative to the more tradi-
tional FP32 and FP16 formats.
Integer: These are integer representations using 8, 4, and 2 bits. They
are often used during the inference phase of neural networks, where
the weights and activations of the model are quantized to these lower
precisions. Integer representations are deterministic and offer signif-
icant speed and memory advantages over floating-point representa-
tions. For many inference tasks, especially on edge devices, the slight
loss in accuracy due to quantization is often acceptable, given the efÏ-
ciency gains. An extreme form of integer numerics is for binary neural
networks (BNNs), where weights and activations are constrained to
one of two values: +1 or -1.
Variable bit widths: Beyond the standard widths, research is on-
going into extremely low bit-width numerics, even down to binary or
ternary representations. Extremely low bit-width operations can offer
significant speedups and further reduce power consumption. While
challenges remain in maintaining model accuracy with such drastic
quantization, advances continue to be made in this area.
EfÏcient numerics is not just about reducing the bit-width of num-
bers but understanding the trade-offs between accuracy and efÏciency.
As machine learning models become more pervasive, especially in real-
world, resource-constrained environments, the focus on efÏcient nu-
merics will continue to grow. By thoughtfully selecting and leveraging
the appropriate numeric precision, one can achieve robust model per-
formance while optimizing for speed, memory, and energy. Table 8.1
summarizes these trade-offs.
CHAPTER 8. EFFICIENT AI 243
8.8 Conclusion
EfÏcient AI is crucial as we push towards broader and more diverse
real-world deployment of machine learning. This chapter provided
an overview, exploring the various methodologies and considerations
behind achieving efÏcient AI, starting with the fundamental need, sim-
ilarities, and differences across cloud, Edge, and TinyML systems.
8.9. Resources 248
8.9 Resources
Here is a curated list of resources to support students and instructors
in their learning and teaching journeys. We are continuously working
on expanding this collection and will add new exercises soon.
Slides
Videos
• Coming soon.
Exercises
• Coming soon.
251
Chapter 9
Model Optimizations
Learning Objectives
9.1 Overview
The optimization of machine learning models for practical deployment
is a critical aspect of AI systems. This chapter focuses on exploring
model optimization techniques as they relate to the development of ML
systems, ranging from high-level model architecture considerations to
low-level hardware adaptations. Figure 9.2 Illustrates the three layers
of the optimization stack we cover.
At the highest level, we examine methodologies for reducing the
complexity of model parameters without compromising inferential ca-
pabilities. Techniques such as pruning and knowledge distillation offer
powerful approaches to compress and refine models while maintain-
ing or even improving their performance, not only in terms of model
quality but also in actual system runtime performance. These meth-
ods are crucial for creating efÏcient models that can be deployed in
resource-constrained environments.
Furthermore, we explore the role of numerical precision in model
computations. Understanding how different levels of numerical preci-