0% found this document useful (0 votes)
147 views

AWS Training Notes - Summary

Uploaded by

jjoxeyejoxeye
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views

AWS Training Notes - Summary

Uploaded by

jjoxeyejoxeye
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 131

AWS Enhanced Prep plan

Contents
AWS Enhanced Prep plan ....................................................................................................... 1
AWS ML Engineer Associate Curriculum Overview ................................................................... 3
Domain 1: Data Processing .................................................................................................... 4
1.1 Collect, Ingest, and Store Data .............................................................................................. 4
1.1.1 COLLECT DATA ................................................................................................................ 4
1.1.2 STORE DATA .................................................................................................................... 9
1.1.3 Data Ingestion ............................................................................................................... 14
1.1.4 Summary ...................................................................................................................... 20
1.2 Transform Data (Data Cleaning, Categorical encoding, Feature Engineering) ........................... 23
1.2.1 Data Cleaning ............................................................................................................... 23
1.2.2 Categorical encoding .................................................................................................... 26
1.2.3 Feature Engineering ...................................................................................................... 27
X. AWS Tools for Data Transformation ....................................................................................... 31
X.1. Data Labeling with AWS .................................................................................................. 31
X.2. Data Ingestion with AWS ................................................................................................. 32
X.3. Data Transformation with AWS ....................................................................................... 33
1.3 Validate Data and Prepare for Modeling .............................................................................. 35
1.3.1 VALIDATE DATA.............................................................................................................. 35
1.3.2 PREPARE FOR MODELLING ........................................................................................... 37
Domain 2: Data Transformation ............................................................................................. 40
2.1 Choose a modelling approach............................................................................................. 41
2.1.1 AWS Model Approaches ................................................................................................ 41
2.1.1 SageMaker Offerings ..................................................................................................... 41
2.1.1 SageMaker Model types ................................................................................................ 42
2.1.3 SageMake AutoML......................................................................................................... 44
2.1.3 SageMake JumpStart .................................................................................................... 45
2.1.5 Bedrock......................................................................................................................... 47
2.2 Train Models ....................................................................................................................... 48
2.2.1 Model Training Concepts .............................................................................................. 48
2.2.2 Compute Environment .................................................................................................. 50
2.2.3 Train a model ................................................................................................................ 51
2.3 Refine Models...................................................................................................................... 56
2.3.1 Evaluating Model Performance ..................................................................................... 56
2.3.2 Model Fit (Overfitting and Underfitting) ......................................................................... 57
2.3.3 Hyperparameter Tuning ................................................................................................ 61
2.3.4 Managing Model Size .................................................................................................... 64
2.3.5 Refining Pre-trained models ......................................................................................... 65
2.3.6 Model Versioning .......................................................................................................... 67
2.4 Analyze Model Performance................................................................................................ 68
2.4.1 Model Evaluation .......................................................................................................... 68
Domain 3: Selecta a deployment infrastructure ......................................................................72
3.1 Select a Deployment Infrastructure .................................................................................... 73
3.1.1 Model building & Deployment Infra ............................................................................... 73
3.1.2 Inference Infrastructure................................................................................................ 75
3.2 Create and Script Infrastructure ......................................................................................... 79
3.2.1 Methods for Provisioning Resources............................................................................. 80
3.2.2 Deploying and Hosting Models ..................................................................................... 85
3.3 Automate Deployment ........................................................................................................ 91
3.3.1 Introduction to DevOps ................................................................................................. 91
3.3.2 CI/CD: Applying DevOps to MLOps ............................................................................... 92
3.3.3 AWS Software Release Processes ................................................................................ 95
3.3.4 Retraining models ......................................................................................................... 99
Domain 4: Monitor Model .................................................................................................... 101
4.1 Monitor Model Performance and Data Quality .................................................................. 102
4.1.1 Monitoring Machine Learning Solutions ...................................................................... 102
4.1.2 Remediating Problems Identified by Monitoring ......................................................... 111
4.2 Monitor and Optimize Infrastructure and Costs ................................................................ 112
4.2.1 Monitor Infrastructure................................................................................................. 112
4.2.2 Optmize Infrastructure ............................................................................................... 114
4.2.3 Optmize Costs ............................................................................................................ 115
4.3 Secure AWS ML Resources ................................................................................................ 116
4.3.1 Securing ML Resources............................................................................................... 116
4.3.3 SageMaker Compliance & Governance ...................................................................... 120
4.3.3 Security Best Practices for CI/CD Pipelines ............................................................... 122
4.3.4 Implement Security & Compliance w/ Monitoring, Logging and Auditing ................... 123
Domain X: Misc .................................................................................................................. 124
X.1 SageMaker Deep Dive ......................................................................................................... 125
X.1.1 Fully Managed Notebook Instances with Amazon SageMaker .................................... 125
X.1.2 SageMaker Built-in Algorithms ................................................................................... 126
X.1.3 SageMaker Training types ........................................................................................... 127
X.1.4 Train Your ML Models with Amazon SageMaker .......................................................... 128
X.1.5 Tuning Your ML Models with Amazon SageMaker........................................................ 129
X.1.6 Tuning Your ML Models with Amazon SageMaker........................................................ 129
X.1.6 Add Debugger to Training Jobs in Amazon SageMaker ................................................ 130
X.1.7 Deployment using SageMaker .................................................................................... 131

AWS ML Engineer Associate Curriculum Overview


▪ Define quantifiable success Criteria of a ML model

▪ Learn components of SageMaker Studio

• Jupyter notebook, schedule job to run Jupyter notebook

• Canvas

• Data (Prepare with Wrangler, Feature Store, EMR Cluster)

• Jobs (training, model evaluation)

• Pipeline

• Model/Model Registry)

• Jumpstart

• Deployment or CI/CD (Inference recommender, Endpoints or Projects)

▪ Additional Things to know: Algorithms - Supervised vs Unsupervised, LLM, Responsible AI/ML,


SageMaker Documentation)

SageMaker Documentation

• Feature Store

• AutoML

• Studio

• Jupyter notebook
Domain 1: Data Processing

1.1 Collect, Ingest, and Store Data


1.1.1 COLLECT DATA
High-Performing data
• REPRESENTATIVE

 Best practice: When building an ML model, it's important to feed it high-quality data that
accurately reflects the real world. For example, if 20 percent of customers typically cancel
memberships after a year, the data should represent that churn rate. Otherwise, the model
could falsely predict that significantly more or fewer customers will cancel.

 Watch for: If your data doesn't actually reflect the real-world scenarios that you want your
model to handle, it will be difficult to identify meaningful patterns and make accurate
predictions.

• RELEVANT

 Best practice: Data should contain relevant attributes that expose patterns related to what
you want to predict, such as membership duration for predicting cancellation rate.

 Watch for: If irrelevant information is mixed in with useful data, it can impact the model's
ability to focus on what really matters. For example, a list of customer emails in a dataset
that's supposed to predict membership cancellation can negatively impact the model's
performance.

• Feature Rich

 Best practice: Data should include a complete set of features that can help the model
learn underlying patterns. You can identify additional trends or patterns to increase
accuracy by including as much relevant data as possible.

 Watch for: Data that has limited features reduces the ability of the ML algorithm to
accurately predict customer churn. For example, if the data consists of a small set of
customer details and omits important data, like demographic information, it will lose
accuracy and miss opportunities for detecting patterns in cancellation rate.

• Consistent

 Best practice: Data must be consistent when it comes to attributes, such as features and
formatting. Consistent data provides more accurate and reliable results.

 Watch for: If the datasets come from various data sources that contain different formatting
methods or metadata, the inconsistencies will impact the algorithm's ability to effectively
process the data. The algorithm will be less accurate with the inconsistent data.
Types of Data
Text

Text data, such as documents and website content, is converted to numbers for use in ML models, especially for
natural language processing (NLP) tasks like sentiment analysis. Models use this numerical representation of
text to analyze the data.

Tabular

Tabular data refers to information that is organized into a table structure with rows and columns, such as the data
in spreadsheets and databases. Tabular data is ideal for linear regression models.

Time series

Time-series data is collected over time with an inherent ordering that is associated with data points. It can be
associated with sensor, weather, or financial data, such as stock prices. It is frequently used to detect trends. For
instance, you might analyze and forecast changes using ML models to make predictions based on historical data
patterns.

Image

Image data refers to the actual pixel values that make up a digital image. It is the raw data that represents the
colors and intensities of each pixel in the image. Image data, like data from photos, videos, and medical scans, is
frequently used in machine learning for object recognition, autonomous driving, and image classification.

Formatting data
• Structured
• Unstructured
• Semi-structured
Data formats and file types
1. Row-based data format
o common in relational databases and spreadsheets.
o It shows the relationships between features

Row-based file types

• CSV

Comma-separated values (CSV) files are lightweight, space-efficient text files that represent tabular
data. Each line is a row of data, and the column values are separated by commas. The simple CSV
format can store different data types like text and numbers, which makes it often used for ML data.
However, the simplicity of CSV comes at the cost of performance and efficiency compared to columnar
data formats with more optimized for analytics.

• Avro RecordIO

Avro RecordIO is a row-based data storage format that stores records sequentially. This sequential
storage benefits ML workloads that need to iterate over the full dataset multiple times during model
training. Additionally, Avro RecordIO defines a schema that structures the data. This schema improves
data processing speeds and provides better data management compared to schema-less formats.

2. Column-based data format


o format, queries extract insights from patterns within a column rather than the entire record, which
results in efficient analysis of trends across large datasets.

Column-based file types

o Parquet

Parquet is a columnar storage format typically used in analytics and data warehouse workloads that
involve large data sets. ML workloads benefit from columnar storage because data can be compressed,
which improves both storage space and performance.

o ORC

Optimized row columnar (ORC) is a columnar data format similar to Parquet. ORC is typically used in big
data workloads, such as Apache Hive and Spark. With the columnar format, you can efficiently
compress data and improve performance. These performance benefits make ORC a widely chosen data
format for ML workloads.

3. Object-notation data
o Object notation fits non-tabular, hierarchical data, such as graphs or textual data.
o Object-notation data is structured into hierarchical objects with features and key-value pairs rather
than rows and columns.

Object-based file types

• JSON
JavaScript Object Notation (JSON) is a document-based data format that is both human and machine
readable. ML models can learn from JSON because it has a flexible data structure. The data is compact,
hierarchical, and easy to parse, which makes it suitable for many ML workloads.

JSON is represented in objects and arrays.

An object is data defined by key-value pairs and An array is a collection of values enclosed in
enclosed in braces {}. The data can be a string, square brackets [ ] and can contain values that
number, Boolean, array, object, or null are separated by commas. The following array
consists of multiple objects.

• JSONL

JavaScript Object Notation Lines (JSONL) is also called newline-delimited JSON. It is a format for
encoding JSON objects that are separated by new lines instead of being nested. Each JSON object is
written on its own line, such as in the following example.

JSONL improves efficiency because individual objects can be processed without loading a larger JSON
array. This improved efficiency when parsing objects results in better handling of large datasets for ML
workloads. Additionally, JSONL structure can map to columnar formats like Parquet, which provides the
additional benefits of those file types.
1.1.4 Graphs for data visualization

Categorical data

Bar Charts Pie Charts Heat maps


Graphs

Used for Comparison analysis Composition analysis Relationship analysis


Features proportion of a dataset for Entirety of the dataset. use color to depict
specific attributes. patterns and
relationships

Numerical Data

Scatterplots Histograms Density Plots Box Plots


Grapgh

Features Data divided into Similar to histograms, displaying the


bins • smooth the distribution location of key data
of data points, such as
• don’t constrain the data median, quartiles,
to bins. and outliers
Assists with Relationship Distribution analysis Distribution analysis Distribution analysis
analysis
Useful for identify distinct overall behavior of a • distribution of a single • quickly comparing
regions single feature feature distributions
• No bins but continuous • identifying
distribution skewness, spread,
and outliers.
1.1.2 STORE DATA

Data Storage Options


1. S3

Features: S3 serves as a central data lake for ingesting, extracting, and transforming data to and from other
AWS services used for processing tasks. These tasks are an integral part of most ML workloads. The ability to
store and retrieve data from anywhere makes Amazon S3 a key component in workflows requiring scalable,
durable, and secure data storage and management.

Considerations: S3 provides scalability, durability, and low cost, but it has higher latency compared to local
storage. For latency-sensitive workloads, S3 might not be optimal. When deciding if S3 meets your needs,
weigh its benefits against potential higher latency. With caching and proper architecture, many applications
achieve excellent performance with S3, but you must account for its network-based nature.

Use cases

• Data ingestion and storage


S3 can be used to store large datasets required for ML. Data can be ingested into S3 through streaming
or batch processing. The data in S3 can then be used for ML training and inference. The scalability and
durability of S3 makes it well-suited for storing the large volumes of data for effective machine learning.
• Model training and evaluation
S3 stores ML datasets and models. It provides versioning to manage different model iterations, so you
can store training and validation data in S3. You can also store trained ML models. With versioning, you
can manage and compare models to evaluate performance.

• Integration with other AWS services


S3 serves as a centralized location for other AWS services to access data. For example,

o SageMaker can access Amazon S3 data to train and deploy ML models.


o Kinesis can stream data into S3 buckets for ingestion.
o AWS Glue can connect to data stored in S3 for data processing purposes.
2. EBS

Features: Amazon EBS is well-suited for databases, web applications, analytics, and ML workloads. The
service integrates with Amazon SageMaker as a core component for ML model training and deployment. By
attaching EBS volumes directly to Amazon EC2 instances, you can optimize storage for ML and other data-
intensive workloads.

Considerations: EBS separates storage from EC2 instances, requiring more planning to allocate and scale
volumes across instances. Instance stores simplify storage by tying storage directly to the EC2 instance
lifecycle. This helps to avoid separate volume management. Although EBS offers flexibility, instance stores
provide streamlined, intrinsic storage management than EBS.

Use cases

• High-performance storage
EBS provides high-performance storage for ML applications requiring fast access to large datasets. EBS
offers volumes with high IOPS for quick data reads and writes. The high throughput and IOPS accelerate
ML workflows and applications.
• Host pre-trained models
With EBS, you can upload, store, and access pre-trained ML models to generate real-time predictions
without setting up separate hosting infrastructure.
3. EFS
Features:

• The service is designed to grow and shrink automatically as files are added or removed, so performance
remains high even as file system usage changes.
• EFS uses the NFSv4 networking protocol to allow compute instances access to the file system across a
standard file system interface. You can conveniently migrate existing applications relying upon on-
premises NFS servers to Amazon EFS without any code changes.

Considerations: EFS has higher pricing, but offers streamlined scaling of shared file systems. EBS offers
lower costs, but there are potential performance limitations based on workload. Consider if the higher EFS
costs and potential performance variability are acceptable trade-offs compared to potentially lower EBS
costs but workload-dependent performance excellent performance with S3, but you must account for its
network-based nature.

Use cases

• Concurrent access
EFS allows multiple EC2 instances to access the same datasets simultaneously. This concurrent access
makes Amazon EFS well-suited for ML workflows that require shared datasets across multiple compute
instances.
• Shared datasets
EFS provides a scalable, shared file system in the cloud that eliminates the need for you to copy large
datasets to each compute instance. Multiple instances can access data, such as ML learning libraries,
frameworks, and models, simultaneously without contention. This feature contributes to faster model
training and deployment of ML applications.

4. Amazon FSx

Features:

• Amazon FSx offers a rich set of features focused on reliability, security, and scalability to support ML,
analytics, and high-performance computing applications.
• The service delivers millions of IOPS with sub-millisecond latency so you can build high-performance
applications that require a scalable and durable file system.

Considerations: When using Amazon FSx for ML workloads, consider potential tradeoffs. Certain file
system types and workloads can increase complexity and management needs. Tightly coupling the ML
workflow to a specific file system also risks vendor lock-in, limiting future flexibility.

Use cases

• Two types of file systems


FSx is a fully managed service that provides two types of file systems: Lustre and Windows File Server.
Lustre allow for high-performance workloads requiring fast storage, such as ML training datasets.

• Distributed architecture
Lustre's distributed architecture provides highly parallel and scalable data access, making it ideal for
hosting large, high-throughput datasets used for ML model training. By managing infrastructure
operations, including backups, scaling, high availability, and security, you can focus on your data and
applications rather than infrastructure management.
Model output Storage Options
1. Training Workloads

Training workloads require high performance and frequent random I/O access to data.

• EBS volumes are well-suited for providing the random IOPS that training workloads need.
Additionally, Amazon EBS instance store volumes offer extremely low-latency data access. This is
because data is stored directly on the instances themselves rather than on network-attached
volumes.
2. Inference Workloads

Need fast response times for delivering predictions, but usually don't require high I/O performance, except
for real-time inference cases.

• EBS gp3 volumes or EFS storage options are well-suited for meeting these needs.
• For increased low-latency demands, upgrading to EBS io2 volumes can provide improved low-
latency capabilities.
3. Real-time and streaming workloads
• EFS file systems allow low latency and concurrent data access for real-time and streaming
workloads. By sharing the same dataset across multiple EC2 instances, EFS provides high
throughput access that meets the needs of applications requiring real-time data sharing.
4. Dataset storage
• S3 can be used for storing large datasets that do not need quick access, such as pretrained ML
models, or data that is static or meant for archival purposes.
Data Access Patterns
There are three common data access patterns in ML: copy and load, sequential streaming, and
randomized access.
Copy and load Sequential streaming Randomized access
Data is copied from S3 to a training Data is streamed to instances as Data is randomly accessed, such
instance backed by EBS. batches or individual records, as with a shared file system data
typically from S3 to instances store, like FSx and EFS.
backed by EBS volumes.
Cost
Cost comparison

• S3 has the lowest cost for each gigabyte of storage based on storage classes. Storage classes are priced
for each gigabyte, frequency of access, durability levels, and for each request.

• EBS has network attached storage, which is more expensive per gigabyte than Amazon S3. However, it
provides lower latency, storage snapshots, and additional performance features that might be useful for
ML workloads.

• EFS is a managed file service with increased costs that can link multiple instances to a shared dataset.
Cost structure is designed around read and write access and the amount of gigabyte used with different
storage tiers available.

• FSx pricing depends on the file system used. General price structure is around storage type used for
each gigabyte, throughput capacity provisioned, and requests.

AWS Tools for Reporting and Cost Optimization

AWS provides several reporting and cost-optimization tools:

• AWS Cost Explorer – See patterns in AWS spending over time, project future costs, identify areas that
need further inquiry, observe Reserved Instance utilization, observe Reserved Instance coverage, and
receive Reserved Instance recommendations.

• AWS Trusted Advisor – Get real-time identification of potential areas for optimization.

• AWS Budgets – Set custom budgets that trigger alerts when cost or usage exceed (or are forecasted to
exceed) a budgeted amount. Budgets can be set based on tags and accounts as well as resource types.

• CloudWatch – Collect and track metrics, monitor log files, set alarms, and automatically react to
changes in AWS resources.

• AWS CloudTrail – Log, continuously monitor, and retain account activity related to actions across AWS
infrastructure at low cost.

• S3 Analytics – Automated analysis and visualization of S3 storage patterns to help you decide when to
shift data to a different storage class.

• AWS Cost and Usage Report – Granular raw data files detailing your hourly AWS usage across accounts
used for Do-It-Yourself (DIY) analysis (e.g., determining which S3 bucket is driving data transfer spend).
The AWS Cost and Usage Report has dynamic columns that populate depending on the services you
use.
1.1.3 Data Ingestion
Realtime Ingestion - streaming services
Amazon Kinesis vs MSK vs Firehose

• Kinesis Data Streams is primarily used for ingesting and processing data.
• Firehose provides a streamlined method of streaming data to data storage locations.
• Amazon Managed Service for Apache Flink provides consumption of streaming data
using Apache Kafka in real-time for analysis.

Streaming use cases


Data Extraction
Extraction

• Amazon S3 Transfer Acceleration

Amazon S3 Transfer Acceleration uses CloudFront edge locations to accelerate large data transfers to and from
S3. These transfers can help speed up data collection for ML workloads that require moving large datasets. S3
Transfer Acceleration overcomes bottlenecks like internet bandwidth and distance that can limit transfer speeds
when working with large amounts of data.

• DMS

AWS Data Migration Service (AWS DMS) facilitates database migration between databases or to Amazon S3 by
extracting data in various formats, such as SQL, JSON, CSV, and XML. Migrations can run on schedules or in
response to events for frequent data extraction. With AWS DMS, you can migrate databases between many
sources and targets.

• AWS DataSync

With AWS DataSync, you can efficiently transfer data between on-premises systems or AWS services by
extracting data from sources, such as data file systems or network-attached storage. You can then upload data
to AWS services like Amazon S3, Amazon EFS, Amazon FSx, or Amazon RDS on a scheduled or event-driven
basis. DataSync facilitates moving large datasets to the cloud while reducing network costs and data transfer
times.

• AWS Snowball

AWS Snowball is a physical device service used to transfer large amounts of data into and out of AWS when
network transfers are infeasible. Snowball devices efficiently and cost-effectively move terabytes or petabytes of
data into S3
Storage

• S3

With S3 serving as a highly scalable object storage service, data used for ML projects can be spread out across
storage locations. Storage can be extracted and transferred to and from S3 with other AWS services. These other
services include Amazon S3 Transfer Accelerator, AWS CLI, AWS SDK, AWS Snowball, AWS DataSync, AWS DMS,
AWS Glue, and AWS Lambda.

• EBS

EBS volumes provide storage for ML data. This data can be copied to services such as Amazon S3 or Amazon
SageMaker, using tools like the AWS Management Console, AWS CLI, or AWS SDKs to manage volumes. EBS
volumes store the necessary data that is then extracted and moved to other AWS services to meet ML
requirements.

• EFS

Amazon EFS allows creating shared file systems that can be accessed from multiple EC2 instances, so you can
share data across compute resources. You can extract data from EFS using AWS CLI, AWS SDKs, or with services
like AWS Transfer Family and DataSync that facilitate data transfers. Amazon EFS provides the capability to share
data from Amazon EC2 instances while also providing tools to conveniently move the data to other services.

• RDS

Amazon Relational Database Service (Amazon RDS) provides relational databases that can be accessed through
AWS services like AWS DMS, the AWS CLI, and AWS SDKs to extract and transfer data. Amazon RDS is a common
source for extracting relational data because it offers managed database instances that streamline data access.

• DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service provided by AWS. You can extract data using
various AWS services like AWS DMS, AWS Glue, and AWS Lambda. You can use programmatic tools, such as the
AWS CLI and AWS SDK, to process and analyze the data outside of DynamoDB. Data extraction allows
DynamoDB data to be integrated with other platforms for further processing.
Data Merging
1. AWS Glue is a fully managed ETL service that you can use to prepare data for analytics and machine
learning workflows.

Best for: Glue works for ETL workloads from varied data sources into data lakes like Amazon S3.

Steps

a) Identify: Identify data sources: AWS Glue can be used to combine or transform large datasets using
Apache Spark. It can efficiently process large structured and unstructured datasets in parallel. AWS
Glue integrates with services like S3, Redshift, Athena, or other JDBC compliant data stores.
b) Create an AWS Glue crawler: AWS Glue crawlers scan data and populate the AWS Glue Catalog.
c) Generate ETL scripts and define jobs: Jobs run the ETL scripts to extract, transform, and load the data,
which can start on demand or can be scheduled to run at specific intervals.
d) Clean and transformed data is written back to S3 or to another data store, such as Amazon Redshift.

2. Amazon EMR

Amazon EMR is a service for processing and analyzing large datasets using open-source tools of big data
analysis, such as Apache Spark and Apache Hadoop. It applies ETL methodologies to ensure the product is
flexible and scalable. Amazon EMR integrates data from multiple sources into one refined platform, making
the transformation of data cost-effective and quick.

Best for: Amazon EMR is best suited for processing huge datasets in the petabyte range.

STEPS

o Ingest streaming sources: ETL is done using Apache Spark Streaming APIs. This makes it possible to
source data in real time from places such as Apache Kafka and Amazon Kinesis. Data is received and
combined in real time.
o Distribute across EMR cluster: EMR clusters are made up of various nodes, each of which are
configured specifically to handle parallel tasks and data processing.
o Generate ETL scripts and define jobs: At the end of the data processing lifecycle, you can use the
Python, Scala, or SQL development environments. These environments give you powerful, flexible
methods for scripting data workflows and for making data filtering, transformation, and aggregation
more convenient.
o Output to Amazon S3: After processing and transforming the data, the results are outputted in an
Amazon S3 bucket.

• Amazon SageMaker Data Wrangler

Amazon EMR is a service for processing and analyzing large datasets using open-source tools of big data
analysis, such as Apache Spark and Apache Hadoop. It applies ETL methodologies to ensure the product is
flexible and scalable. Amazon EMR integrates data from multiple sources into one refined platform, making
the transformation of data cost-effective and quick.

3.Clean and enrich


Cleanse and explore data, perform feature engineering with built-in data transforms, and detect
statistical bias with Amazon SageMaker Clarify.
6.Integrate a data preparation workflow
Use Amazon SageMaker Pipelines to integrate a data preparation workflow.
7.Export prepared data
Export data to SageMaker Feature Store or S3.

When to use which (Data Wrangler vs Glue vs EMR)

Feature AWS Glue Amazon EMR SageMaker Data Wrangler

Purpose Serverless ETL service Big data processing platform ML-focused data preparation

Requires cluster setup and Visual interface, no coding


Ease of Use Visual ETL designer
management required

Ideal Vol. • Medium to large datasets • Ideal for very large datasets • Small to medium datasets

ML Can prepare data for ML, but Can run ML frameworks, but Tightly integrated with SageMaker
Integration not specialized requires setup ML workflow

• Batch and real-time • Quick data exploration and


Ideal Use • Batch data transformations visualization
processing
Cases • Data catalog management • Data cleaning and
• Complex big data
• Serverless data integration transformation for ML models
processing
Troubleshooting
Scalability issues

• Capacity issues with data destinations

With EFS, FSx, and S3, you can seamlessly scale storage increasing or decreasing in size.

• Latency issues, IOPs, and data transfer times

High latency, insufficient IOPs, or slow data transfer times significantly impact performance of storage
systems and data ingestion. Network bottlenecks, undersized provisioned storage volumes, or using
inefficient methods of ingestion can arise from these factors.

Consider optimizing network configurations or use AWS services with improved performance
capabilities, such as provisioned IOPS volumes for EBS. Using techniques, such as compression or
batching, can also lead to improved data-transfer efficiency.

• Uneven distribution of data access

Hotspots or bottlenecks in storage systems can be caused by uneven distribution of data access
resulting in performance degradation or data availability issues.

AWS services, such as S3 and EFS, automatically distribute data and provide load balancing
capabilities. Data partitioning is another strategy that can be implemented, which distributes data
across multiple storage resources, reducing the likelihood of hotspots.

Ingestion modifications

DO THE ASSESSMENT!!
1.1.4 Summary
Keywords/Concepts AWS Service/Option

• Data lake, central storage, scalable, durable


Amazon S3
• Large static datasets, archival

High-performance storage, real-time predictions, pre-trained models Amazon EBS

• Concurrent access, shared datasets, scalable file system


Amazon EFS
• Real-time/streaming workloads

High-performance computing, distributed architecture, Lustre Amazon FSx

• EBS (especially io2)


Training workloads, high IOPS, random access
• Instance Store

• EBS gp3
Inference workloads (standard)
• EFS

Inference workloads (low-latency) EBS io2

Copy and load pattern


S3 to EBS
Sequential streaming pattern

Randomized access pattern FSx or EFS

Columnar storage, data compression Parquet, ORC

Row-based storage, sequential records CSV, Avro RecordIO

Hierarchical data, flexible structure JSON, JSONL

Data Types and Processing

Data Type Suitable For Not Suitable For

Complex pattern recognition in unstructured


Tabular data Linear regression, classification
data

Columnar data Data analysis, efficient querying Frequent record updates

Time series data Trend analysis, forecasting Non-temporal pattern recognition

Object recognition, image


Image data Text-based analysis
classification

Natural language processing,


Text data Image or numerical analysis
sentiment analysis
Data Formats

Format Suitable For Storage Type Characteristics

Simple tabular data, easy human


CSV space-efficient
readability
Row-based
ML workloads requiring multiple
Avro RecordIO schema-defined
dataset iterations

Efficient compression, fast


Parquet Large-scale data analysis
querying
columnar
Optimized for large-scale
ORC Big data analysis (Hive, Spark)
data processing

Flexible structure, easy


JSON/JSONL Hierarchical, non-tabular data Hierarchical
parsing

Data Visualization

See Above

AWS Storage Options

Service Advantages Limitations

Amazon S3 Scalable, durable, central data lake Higher latency compared to local storage

Amazon High IOPS, suited for databases and Limited to single EC2 instance, requires volume
EBS ML training management

Amazon Concurrent access, scalable shared


Higher costs, performance dependent on network speed
EFS file system

Amazon Lowest latency, high-performance Potential vendor lock-in, complex management for some
FSx computing file system types
Data Access Patterns

Pattern Best With Characteristics

Copy and Load S3 to EBS Data copied entirely before processing

Sequential Streaming S3 to EBS Data streamed in batches or individual records

Randomized Access EFS, FSx Random data access, shared file system

Use Case Recommendations

Use Case Recommended Service Reason

• EBS (io2)
Training Workloads High IOPS, low-latency random access
• Instance Store

• EBS (gp3)
Inference Workloads (standard) Balance of performance and cost
• EFS

Inference Workloads (low-latency) EBS (io2) Higher IOPS for faster response times

Real-time/Streaming Workloads EFS Concurrent access, shared datasets

Large Static Datasets S3 Cost-effective for infrequently accessed data

• EFS
Distributed Processing Concurrent access from multiple instances
• FSx
1.2 Transform Data (Data Cleaning, Categorical encoding, Feature Engineering)

Remember

• Data cleaning focuses on handling issues like missing data and outliers.
• Categorical encoding - Used to convert values into numeric representations.
• Feature engineering focuses on modifying or creating new features from the data, rather than
encoding features.

1.2.1 Data Cleaning


Incorrect and Duplicate Data

deduplication - The process of automating data duplication removal. Deduplication works by


scanning datasets for duplicate information, retaining one copy of the information, and replacing the
other instances with pointers that direct back to the stored copy. Deduplication can drastically
increase storage capacity by keeping only unique information in your data repositories.
Data Outliers
Methods

1. Calculating mean and median

Mean Median

The mean is the average of all the values in the The median is the value in the dataset that
dataset. Mean can be a useful method for divides the values into two equal halves.
understanding your data when the data is If your data is skewed or contains outliers, the
symmetrical. median tends to provide the better metric for
understanding your data as it relates to central
tendency.

For example, a symmetrical set of data that For example, a dataset that contains income
contains ages of respondents might reflect that level might contain outliers. The mean might
both the mean and median of the dataset is 50 skew toward higher or lower values, while the
years old. median would provide a more accurate picture
of the data's central tendency.

2. Identifying natural and artificial outliers

Natural outliers Artificial outliers


Natural outliers are data points that are Artificial outliers are anomalous
accurate representations of data, but are data points in your dataset due
extreme variations of the central data to error or improper data
points. For example, in a dataset that collection. For example, a faulty
includes height measurements of sensor in a thermometer might
individuals, an extremely tall individual produce a body temperature
would represent a natural outlier. that is unrealistically high or low
compared to expected body
temperatures.

Artificial outlier - This data is in the correct format for the Age column, but an entry of 154 is unrealistic. In this
case, it makes the most sense to delete this entry from your data.
Natural outlier - Although three million dollars a year is drastically more than the rest of the salaries in our
dataset, this number is still plausible. In this case, you can choose to transform the outlier and reduce the outlier's
influence on the overall dataset. You will learn about that type of data transformation later in this course.
Incomplete and Missing Data
There are some key steps you can take to address incomplete and missing values in your dataset.

a) Identify missing values

There are certain Python libraries, such as Pandas, that you can use to check for missing values.

b) Determine why values are missing

Before you can determine how to treat the missing values, it’s important to investigate which mechanisms
caused the missing values. The following are three common types of missing data:

• Missing at Random (MAR): The probability that a data point is missing depends only on the observed data,
not the missing data.

Example: In a dataset of student test scores, scores are missing for some students who were
absent that day. Absence is related to performance.

• Missing Completely at Random (MCAR): The probability that a data point is missing does not depend on
the observed or unobserved data.

Example: In an employee survey, some people forgot to answer the question about their number of
siblings. Their missing sibling data does not depend on any values.

• Missing Not at Random (MNAR): The probability that a data point is missing depends on the missing data
itself.

Example: In a financial audit, companies with accounting irregularities are less likely to provide
complete records. The missing data depends on the sensitive information being withheld.

• Drop missing values

Depending on what is causing your missing values, you will decide to either drop the missing values or impute
data into your dataset.

One of the most straightforward ways to deal with missing values is to remove the rows of data with missing
values. You can accomplish this by using a Pandas function. Dropping rows or columns will make the dataset
non-missing. However, the risk of dropping rows and columns is significant.

Issues

o If you have hundreds of rows or columns of data, all of that missing data might cause bias in your
model predictions.
o If you drop too much data, you might not have enough features to feed the model.
• Impute values

Missing values might be related to new features that haven’t included your dataset yet. After you include more
data, those missing values might be highly correlated with the new feature. In this case, you would deal with
missing values by adding more new features to the dataset. If you determine the values are missing at random,
data imputation, or inputting the data into your dataset, is most likely the best option.

One common way to impute missing values is to replace the value with the mean, median, or most frequent
value. You would select the mean, median, or most frequent value for categorical variables. You would select the
mean or median for numerical variables. Choosing the mean, median, or most frequent value depending on your
business problem and data collection procedures.
1.2.2 Categorical encoding
Categorical encoding is the process of manipulating text-based variables into number-based variables.

When to encode
Not all categorical variables need to be encoded. Depending on your use case, different ML algorithms might not
require you to encode your variables.

For instance, a random forest model can handle categorical features directly. You would not need to encode
values, such as teal, green, and blue, as numeric values.

Encoding Types (or types of Categorical values)


• Binary categorical values refer to values that are one of two options. For example, a column of data might be
true or false, such as if someone attended an event.

• Nominal, or multi-categorical, values are category values where order does not matter. A set of data that
contains different geographic locations, such as states or cities, might be considered multi-categorical.

• Ordinal, or ordered, values are category values where the order does matter, like what size of drink you order
at a coffee shop: small, medium, or large.

Encode Techniques
Not all categorical variables

• Label encoding converts categorical values into binary numbers

• One Hot encoding : creating a new binary feature for each unique category value. Rather than assigning
a different value for each data point like binary encoding, one-hot encoding sets the feature to 1 or 0
depending on if the category that applies to a given data point.

When to use which: Label coding might not be the best technique if there are a lot of categories. In One
hot encoding, these additional columns might grow your dataset so much that it makes it difficult to
analyze efficiently.
1.2.3 Feature Engineering
Feature engineering is a method for transforming raw data into more informative features that help models better
capture underlying relationships in the data.

Feature Engineering by data type (numeric, text, image, and sound data types)
We only cover numeric and text here

Numeric feature engineering involves transforming numeric values for the model and is often accomplished by
grouping different numeric values together.

Text feature engineering involves transforming text for the model and is often accomplished by splitting the text
into smaller pieces.
1. Numerical Feature Engineering
• Purpose: Aims to transform the numeric values so that all values are on the same scale.
• Why: This method helps you to take large numbers and scale them down, so that the ML algorithm can
achieve quicker computations and avoid skewed results.

a) Feature Scaling:

Normalization Standardization
rescales the values (often between 0 and 1) Similar, but mean of 0 and standard deviation of 1
When to use: Reduces the negative effect of outliers.

b) Binning:

The data is divided into these bins based on value ranges, thus transforming a numeric feature into a
categorical one.

When: numerical data when the exact difference in numbers is not important, but the general range is a
factor.

c) Log transformation:

The most common logarithmic functions have bases of 10 or e, where e is approximately equal to
2.71828. Logarithmic functions are the inverse of exponential functions and are useful for modeling
phenomena where growth decreases over time, like population growth or decay.

When: When skewed numeric data, or multiple outliers. Basically, log compresses data to use lower
numbers

For example, the log of $10,000 would be around 4 and the log of $10,000,000 would be around 7. Using
this method, the outliers are brought much closer to the normal values in the remainder of the dataset.
2. Text Engineering

a) Bag of Words: The bag-of-words model does not keep track of the sequence of words, but counts the
number of words in each observation. Bag-of-words uses tokenization to create a statistical
representation of the text.

b) N-gram: builds off of bag-of-words by producing a group of words of n size.

c) Temporal data: Temporal or time series data is data involving time, like a series of dates. Temporal data
can come in many formats, and is often a mix of numeric and text data.

When to use which

Bag of Words create a statistical representation of the text


N-grams Phrase of n-size important (like sentiment analysis)
Temporal data Capture key trends, cycles, or patterns over time
Feature Selection Techniques

• Feature splitting & Feature combining

Feature splitting Feature combining


breaks down features into multiple
derived features

• Principal component analysis

Statistical technique that you can use for dimensionality reduction

Principal component – Size{|: The first component accounts for the physical
attributes of the home that includes square footage, bedrooms, and bathrooms

Principal component - Cost: The second component captures the financial


factors that include price and tax rate.
X. AWS Tools for Data Transformation
X.1. Data Labeling with AWS
1. Mechanical Turk

Purpose:
o Image annotation:
o Text annotation:
o Data collection:
o Data cleanup:
o Transcription:

2. SageMaker Ground Truth

Uses Mechanical Turk and other data processing methods to streamline the data preparation process
even further.

Purpose
o Image annotation:
o Text annotation:
o Object Detection
o Named Entity Recognition

3. SageMaker Ground Truth Plus

Fully managed data labeling service using expert labelers.

When to use which

Mechanical Turk
On-demand access to workers, lower costs, fast turnaround times, task flexibility, and
quality control capabilities.
SageMaker Ground Truth
Higher quality, or Object detection or NER, public sourcing
SageMaker Ground Truth Plus
Production labeling workflows, sensitive data, complex tasks, and custom interfaces
X.2. Data Ingestion with AWS
1. Data Wrangler

Purpose: visual, code-free tool for data preprocessing and feature engineering

Steps

o Clean data:
o Feature engineering: Combine columns and apply formulas, etc.
o Fixing formatting issues: Missing headers, encoding problems, etc.
o Reducing data size: For large datasets
o Automating transformations:

2. SageMaker Feature Store

What:

o Managed repository for storing, sharing, and managing features.


o Storing features saves time by eliminating redundant feature engineering efforts.

Steps

• Automated data preprocessing


• Centralized feature repository:
• Management of feature pipelines:
• Standardized features:
• Caching for performance:

When to use which


• Use Data Wrangler:
o Initial data exploration
o one-time transformations
o when working directly in notebooks.
• Use Feature Store
o Moving to production
o Sharing features across models or teams
o when you need low-latency feature serving for online inference.
X.3. Data Transformation with AWS
1. Data Glue

Purpose:

• AWS Glue auto-generates Python code to handle issues like distributed processing, scheduling, and
integration with data sources.
• AWS Glue DataBrew is visual data preparation tool for cleaning, shaping, and normalizing datasets.

Use cases
o Automated ETL pipelines
o Data integration and ingestion:
o Data cleansing and standardization
o Feature engineering:
o Final pretraining data preparation
Steps

2. SageMaker Data Wrangler

Purpose:

• We know SageMaker Data Wrangler to ingest data.


• SageMaker Data Wrangler can also help explore, clean, and preprocess data without writing code.

Use Cases

• Clean Data & Fix formatting issues


• Feature Engineering
• Reducing data size
• Automating transformations

Steps

When to use:

• Glue: Production ETL pipelines, large-scale data processing, scheduled jobs

• Data Wrangler: Exploratory data analysis, quick transformations, ML data prep in SageMaker
3. For Streaming data

Lambda Spark on Amazon EMR


• Data normalization: • Real-time analytics:
• Data filtering: • Anomaly detection:
• Transcoding media: • Monitoring dashboards
1.3 Validate Data and Prepare for Modeling
1.3.1 VALIDATE DATA
Basics
Bias Metrics

Difference in proportion of
Class imbalance (CI)
labels (DPL)
Occurs when distribution of classes in the training Compare the distribution of labels
data is skewed (one class significantly less in data
represented than the others).
Understand If CI +ve, advantaged group is relatively If DPL +ve, one class significantly
overrepresented in this dataset. higher proportion.
If CI -ve, advantaged group is relatively If DPL -ve, one class significantly
underrepresented in this dataset. less proportion.

Data Validation Strategies


Resampling Synthetic data generation Data augmentation
Add/Update Adding data Adding Data Transform data
Auto/Manual Manually Algorithmically Algorithmically
How Oversample/under Creating new artificial data
sample points
Technique(s) SMOTE GAN
Type of Data Numeric Text based Image

AWS tools for Data Validation


Glue Data Quality Glue DataBrew Comprehend
What Managed monitoring Visual data prep tool NLP tool
Use cases • Data Validation • Data profiling • Entity recognition
• Data quality • Built-in transformations • Language detection
• Automated scheduling • Custom transformations • Topic modeling
• Data quality dashboards
SageMaker Clarify
How it works

• Create a bias report using the configuration for pre-training and post-training analysis
• Assess the bias report by considering the class imbalance (CI) and difference in proportion of labels
(DPL).
1. Set up the bias report: To set up
your bias report configurations, use
the BiasConfig to provide
information on which columns
contain the facets with the sensitive
group of sex, what the sensitive
features might be using
facets_values_or_threshold, and
what the desirable outcomes are
using labels_values_or_threshold.
2. Run the bias report: create the bias
report using the configuration for
pretraining and post-training
analysis. This step takes
approximately 15-20 minutes

3. Access the bias report:

4. Assess the Class Imbalance: The


CI shows -0.33 as the bias metric. CI
measures if the advantaged group,
men, is represented in the dataset
at a substantially higher rate than
the disadvantaged group, women, or
the other way around. The negative
value demonstrated indicates that
the advantaged group, men, is
relatively underrepresented in this
dataset example.
5. Assess the Difference in Positive
Proportion of Labels:
Review the DPL. This metric detects
a label imbalance between classes
that might cause unwarranted
biases during training.
1.3.2 PREPARE FOR MODELLING
Dataset Splitting, Shuffling, and Augmentation
Data Splitting Techniques

Train, test, validate Cross-validation

Best for • Easy implementation • Uses the entire dataset for training and
• Provides quick estimate of model testing, maximizing data usage
performance • Reduces variance in performance
estimation by averaging results across
multiple iterations
Limitations • Performance estimate might have • Computationally more expensive,
variance due to dependency on especially for large datasets
specific examples in the test set • Might be sensitive to class imbalances if
• Not suitable for small datasets not stratified properly
because it might lead to overfitting or
underfitting
Example • K-fold cross-validation

Dataset shuffling

Benefits of data shuffling: Dataset shuffling plays a crucial role in mitigating biases that might arise from the
inherent structure of the data. By introducing randomness through shuffling, you can help the model be
exposed to a diverse range of examples during training.

Data shuffling techniques:


Dataset Augmentation

Data augmentation works by creating new, realistic training examples that expand the model's
understanding of the data distribution. Dataset augmentation involves artificially expanding the size and
diversity of a dataset

Data augmentation techniques:

• Image-based Augmentation
o Flipping, rotating, scaling, or shearing images
o Adding noise or applying color jittering
o Mixing or blending images to create new, synthetic examples
• Text-based Augmentation
o Replacing words with synonyms or antonyms
o Randomly inserting, deleting, or swapping words
o Paraphrasing or translating text to different languages
o Using pre-trained language models to generate new, contextually relevant text
• Time series Augmentation
o Warping or scaling the time axis
o Introducing noise or jitter to the signal
o Mixing or concatenating different time-series segments
o Using generative models to synthesize new time-series data

When to use which


✓ Data-splitting (Train, Test, Validate):
o Pros: Clear separation of data, prevents data leakage
o Cons: Reduces amount of data available for training
✓ Cross-validation:
o Pros: Makes efficient use of all data, robust performance estimate
o Cons: Computationally expensive, may not be suitable for very large datasets
✓ Data shuffling:
o Pros: Reduces bias, improves generalization
o Cons: May not be appropriate for time-series data where order matters
✓ Data augmentation:
o Pros: Increases dataset size, improves model robustness
o Cons: May introduce artificial patterns if not done carefully
AWS services for pre-training data configuration
Final formatting process

SageMaker built-in algorithms for formatting

• CSV: Many built-in algorithms in SageMaker:


o XGBoost
o linear learner
o DeepAR.
• RecordIO-protobuf: Commonly used for image data, where each record represents an image and
its associated metadata.

Steps after formatting:

• Upload data to Amazon S3


• Mount Amazon EFS or Amazon FSx
• Copy data from Amazon S3 to EFS or FSx
• Use AWS data transfer utilities or custom script
• Verify data integrity by checking file sizes and checksums after the transfer is complete.
• Load the data into your ML training resource
• With Amazon EFS -> you would create an EFS file system and mount it to your SageMaker
notebook instance or training job. Copy dataset files into the EFS file system. Then in your
training script, load the data by accessing the Amazon EFS mount path.
• For Amazon FSx -> create a Lustre file system and attach it to your SageMaker resource. Copy
the data files to the FSx Lustre file system. In your training script, load the data by accessing the
Amazon FSx mount path.
• Note that both Amazon EFS and Amazon FSx for Lustre provide shared file storage that can be
accessed from multiple Amazon Elastic Compute Cloud (EC2) instances at the same time.
• Monitor, refine, scale, automate, and secure
• When your data is loaded into your resource, you will continue to monitor, refine, scale,
automate, and secure your ML workloads. Monitoring, refining, scaling, automating, and
securing your workloads is a complex and involved part of the ML lifecycle.
• Implement data lifecycle management strategies, such as archiving or deleting old or unused
data.
• Consider using AWS services like AWS Step Functions, AWS Lambda, Amazon Managed
Workflows for Apache Airflow (Amazon MWAA), and AWS CodePipeline to automate and
orchestrate your data workflows.
Domain 2: Data Transformation
2.1 Choose a modelling approach
2.1.1 AWS Model Approaches
AWS AI/ML stack:
• AWS AI services: NO fine-tune option
• AWS ML services: fine-tune option
• Customized ML model solutions (using AWS infrastructure and frameworks)

2.1.1 SageMaker Offerings


Studio
2. Roles and Persona

Choice of IDEs

Choice of IDEs SageMaker notebook instances


SageMaker notebook instances initiate Jupyter servers on Amazon Elastic Compute Cloud (Amazon
EC2) and provide preconfigured kernels with the following packages:

o Amazon SageMaker Python SDK, AWS SDK for Python (Boto3)


o AWS Command Line Interface (AWS CLI)
o Conda
o Pandas
o Deep learning framework libraries
o Other libraries for data science and ML
2.1.1 SageMaker Model types
SageMaker notebook instances initiate Jupyter servers on Amazon Elastic Compute Cloud (Amazon EC2) and
provide preconfigured

1. Supervised

Algorithm Classification: Binary Classification: Multi-class Regression

Linear Learner Yes No Yes

XGBoost Yes Yes Yes

K-Nearest Neighbors No Yes Yes

Factorization Machines No No Yes


2. Unsupervised

3. Text or speech data

4. Images and video (or time series data)


5. Reinforcement learning (RL)
To train RL models in SageMaker RL, use the following components:

• A deep learning (DL) framework. Currently, SageMaker supports RL in TensorFlow and Apache MXNet.
• An RL toolkit. An RL toolkit manages the interaction between the agent and the environment and
provides a wide selection of state of the art RL algorithms. SageMaker supports the Intel Coach and Ray
RLlib toolkits. For information about Intel Coach, see https://nervanasystems.github.io/coach/(opens in
a new tab). For information about Ray RLlib, see https://ray.readthedocs.io/en/latest/rllib.html(opens in
a new tab).
• An RL environment. You can use custom environments, open-source environments, or commercial
environments. For information, see RL Environments in Amazon SageMaker(opens in a new tab).

2.1.3 SageMake AutoML


SageMaker

• Data analysis and processing: SageMaker Autopilot identifies your specific problem type, handles missing
values, normalizes your data, selects features, and prepares the data for model training.
• Model selection: SageMaker Autopilot explores a variety of algorithms. SageMaker Autopilot uses a cross-
validation resampling technique to generate metrics that evaluate the predictive quality of the algorithms
based on predefined objective metrics.
• Hyperparameter optimization: SageMaker Autopilot automates the search for optimal hyperparameter
configurations.
• Model training and evaluation: SageMaker Autopilot automates the process of training and evaluating
various model candidates.
o It splits the data into training and validation sets, and then it trains the selected model candidates
using the training data.
o Then it evaluates their performance on the unseen data of the validation set.
o Lastly, it ranks the optimized model candidates based on their performance and identifies the best
performing model.
• Model deployment: After SageMaker Autopilot has identified the best performing model, it provides the
option to deploy the model. It accomplishes this by automatically generating the model artifacts and the
endpoint that expose an API. External applications can send data to the endpoint and receive the
corresponding predictions or inferences.
2.1.3 SageMake JumpStart
SageMaker JS is a ML hub with foundation models, built-in algorithms, and prebuilt ML solutions that you can
deploy with a few clicks.

Features

Foundation Models
With JumpStart foundation models, many models are available such as:

• Jurassic models from AI21


• Stable Diffusion from Stability.ai
• Falcon from HuggingFace
• Llama from Meta
• AlexaTM from Amazon

JumpStart industry-specific solutions


• Demand forecasting

Amazon SageMaker JumpStart provides developers and data science teams ready-to-start AI/ML models and
pipelines. SageMaker JumpStart is ready to be deployed and can be used as-is. For demand forecasting,
SageMaker JumpStart comes with a pre-trained, deep learning-based forecasting model, using Long- and Short-
Term Temporal Patterns with Deep Neural Networks (LSTNet).

• Credit rating prediction

Amazon SageMaker JumpStart solution uses Graph-Based Credit Scoring to construct a corporate network from
SEC filings (long-form text data).

• Fraud detection

Detect fraud in financial transactions by training a graph convolutional network with the deep graph library and a
SageMaker XGBoost model.
• Computer vision

Amazon SageMaker JumpStart supports over 20 state-of-the-art, fine-tunable object detection models from
PyTorch hub and MxNet GluonCV. The models include YOLO-v3, FasterRCNN, and SSD, pre-trained on MS-
COCO and PASCAL VOC datasets.

Amazon SageMaker JumpStart also supports image feature vector extraction for over 52 state-of-the-art image
classification models including ResNet, MobileNet, EfficientNet from TensorFlow hub. Use these new models to
generate image feature vectors for their images. The generated feature vectors are representations of the images
in a high-dimensional Euclidean space. They can be used to compare images and identify similarities for image
search applications.

• Extract and analyze data from documents

JumpStart provides solutions for you to uncover valuable insights and connections in business-critical
documents. Use cases include text classification, document summarization, handwriting recognition,
relationship extraction, question and answering, and filling in missing values in tabular records.

• Predictive maintenance

The AWS predictive maintenance solution for automotive fleets applies deep learning techniques to common
areas that drive vehicle failures, unplanned downtime, and repair costs.

• Churn prediction

After training this model using customer profile information, you can take that same profile information for any
arbitrary customer and pass it to the model. You can then have it predict whether that customer is going to churn
or not. Amazon SageMaker JumpStart uses a few algorithms to help with this. LightGBM, CatBoost,
TabTransformer, and AutoGluon-Tabular used on a churn prediction dataset are a few examples.

• Personalized recommendations

Amazon SageMaker JumpStart can perform cross-device entity linking for online advertising by training a graph
convolutional network with a deep graph library.

• Healthcare and life sciences

You could use the model to summarize long documents with LangChain and Python. The Falcon LLM is a large
language model, trained by researchers at the Technology Innovation Institute (TII) on over 1 trillion tokens using
AWS. Falcon has many different variations, with its two main constituents Falcon 40B and Falcon 7B, comprised
of 40 billion and 7 billion parameters, respectively. Falcon has fine-tuned versions trained for specific tasks,
such as following instructions. Falcon performs well on a variety of tasks, including text summarization,
sentiment analysis, question answering, and conversing.

• Financial pricing

Many businesses dynamically adjust pricing on a regular basis to maximize their returns. Amazon SageMaker
JumpStart has solutions for price optimization, dynamic pricing, option pricing, or portfolio optimization use
cases. Estimate price elasticity using Double Machine Learning (ML) for causal inference and the Prophet
forecasting procedure. Use these estimates to optimize daily prices.

• Causal inference

Researchers can use machine learning models such as Bayesian networks to represent causal dependencies
and draw causal conclusions based on data.
2.1.5 Bedrock
Use cases
2.2 Train Models
2.2.1 Model Training Concepts
Minimizing loss:

• Loss function needs to be minimum (Global minimum)


• Gradient descent optimization used to reach “global minima” loss
• Hyperparameter tuning uses Gradient descent to reach “global minima” loss
o Too much tuning can result in “Overshoot”, learning rate too high
o Too less tuning can result in “Undershooting”, learning rate too small

(Measuring) Loss function:


Root mean square error Log-likelihood loss
• The most basic form of a loss function, • A variation of a loss function is log-likelihood
commonly used in regression tasks loss, also known as cross-entropy loss, is
such as predicting continuous values used in logistic regression.
is known as Root Mean Square Error • With log-likelihood loss, instead of the raw
(RMSE). probabilities of predictions of each class, the
• A regression task can be used to logarithm of probabilities is considered.
predict home prices.

When to use which

Log-likelihood loss is an algorithm used for classification tasks, where the goal is to predict whether
an input belongs to one of two or more classes. For example, you might use logistic regression to
predict whether an email is spam.
Optimizing - Reducing Loss function:
Optimization Stochastic gradient Mini-batch gradient
Gradient descent
technique descent descent
Weights updated Every epoch Every datapoint Every batch
Speed of each
Slowest Fast Slower
epoch calculation
Smooth updates toward the Noisy or erratic updates Less noisy or erratic
Gradient steps
minima toward the minima updates toward the minima

Gradient descent

As mentioned, gradient descent only updates weights after it's gone through all of the data, also
known as an epoch. Of the three variations covered here

• gradient descent has the slowest speed to finding the minima as a result, but
• also has the fewest number of steps to reach the minima.

In stochastic gradient descent or SGD, you update your weights for each record you have in your
dataset.

Stochastic Gradient Descent (SGD)

For example, if you have 1000 data points in your dataset, SGD will update the parameters 1000 times.
With gradient descent, the parameters would be updated only once in every epoch.

• SGD leads to more parameter updates and, therefore, the model will get closer to the minima
more quickly.
• One drawback of SGD, however, is that it will oscillate in different directions, unlike gradient
descent, hence lot more steps.

Mini-batch gradient descent

Hybrid of gradient descent and SGD, this approach uses a smaller dataset or a batch of records, also
called batch size, to update your parameters.

• Mini-batch gradient descent updates more than gradient descent while having less erratic or
noisy updates as compared to SGD. The user-defined batch size helps you fit the smaller
dataset into memory. Having a smaller dataset helps the algorithms run on almost any average
computer that you might be using.
2.2.2 Compute Environment

AWS Instances for ML:

AWS offers solutions for a variety of specific ML tasks, and this permits you to optimize on your particular use
case scenarios.

AWS Container Services:


Keyword
Amazon ECS ECS simplifies the process of running and General/custom Container
managing containerized applications on
AWS, offering various deployment options,
and seamlessly integrating with other AWS
services.
Amazon EKS Amazon EKS provides a fully managed Kubernetes
Kubernetes control plane and seamless
integration with other AWS services
AWS Fargate Fully Managed Container
Amazon ECR ECR makes it easy to store, manage, and Container Registry
deploy container images.

Containers in SageMaker for ML model generation


1. SageMaker managed container images
• You can use built-in training algorithms included in these containers, ML Framework,
settings, libraries, and dependencies included in these container images, but provide your
own custom training algorithms. This approach is referred to as script mode.

2. Customer-managed container images (BYOC)


• You can build your own container using the Bring Your Own Container (BYOC) approach if
you need more control over the algorithm, framework, dependencies, or settings.
• Some industries might require BYOC or BYOM to meet regulatory and auditing
requirements.
2.2.3 Train a model
Create training job
• Create IAM role
• Chose algorithm source (built-in, etc.)
• Choose algorithm
• Configure compute resource
• Set hyperparameters (+ default)
• Specify data type
• Choose Data source (default S3)

Model created
• Store in S3
• Package and distribute
• Register model (in registry)

Train a model

For built-in algorithms, the only inputs you need to provide are the

• training data
• hyperparameters
• compute resources.
Amazon SageMaker training options
When it comes to training environments, you have several to choose from:

• Create a training job using SageMaker console (see the Creating a Training Job Using the Amazon
SageMaker Console lesson for an example using this method).

• Use AWS SDKs for the following:

o The high-level SageMaker Python SDK

o The low-level SageMaker APIs for the SDK for Python (Boto3) or the AWS CLI

Training data sources


• S3
• Amazon EFS
• Amazon FSx for Lustre

Training ingestion modes


Pipe mode File mode Fast File mode

What SageMaker streams data SageMaker will download SageMaker can stream data
directly from Amazon S3 the training data from S3 to directly from S3 to the
to the container, without the provisioned ML storage container with no code
downloading the data to volume. Then it will mount changes. Users can author their
the ML storage volume the directory to the Docker training script to interact with
volume for the training these files as though they were
container. stored on disk.
Pros Improve training In a distributed training Fast File mode works best when
performance by reducing setup ,the training data is the data is read sequentially.
the time spent on data distributed uniformly
download across the cluster.
Cons Manually ensure ML Augmented manifest files are
storage volume has not supported. The startup time
sufficient capacity to is lower when there are fewer
accommodate data from files in the S3 bucket provided.
Amazon S3.

When to use which

Pipe mode File mode Fast File mode

large training datasets X X


training algorithm reads data sequentially X X
Amazon SageMaker training – Script mode
Amazon SageMaker script mode provides the flexibility to develop custom training and inference code
while using industry-leading machine learning frameworks.

Steps to bring your own script using SageMaker script mode

1. Use your local laptop or desktop with the SageMaker Python SDK. You can get different
instance types, such as CPUs and GPUs, but are not required to use the managed notebook
instances.

2. Write your training script.

3. Create a SageMaker estimator object, specifying the

a) training script

b) instance type

c) other configurations.

4. Call the fit method on the estimator to start the training job, passing in the training and
validation data channels.

5. SageMaker takes care of the rest. It pulls the image from Amazon Elastic Container Registry
(Amazon ECR) and loads it on the managed infrastructure.

6. Monitor the training job and retrieve the trained model artifacts once the job is complete.

Example

In this example, the PyTorch estimator is configured with the training script using the entry_point: train.py,
instance type ml.p3.2xlarge, and other settings. The fit method is called to launch the training job, passing in the
location of the training data.
Reducing training time
Amazon SageMaker script mode provides the flexibility to develop custom training and inference code
while using industry-leading machine learning frameworks.

a) Early stopping:
Early stopping is a regularization technique that shuts down the training process for a ML model
when the model's performance on a validation set stops improving.

How early stopping works in Amazon SageMaker


Amazon SageMaker provides a seamless integration of early stopping into its hyperparameter tuning
functionality, so users can use this technique with minimal effort. Here is how early stopping works in
SageMaker:

a) Evaluating objective metric after each epoch: During the training process, SageMaker evaluates the
specified objective metric (for example, accuracy, loss, F1-score) for each epoch or iteration of the
training job.

b) Comparing to running median of previous training jobs

c) Stopping current job if performing worse than median:

b) Distributed training
A. Data parallelism is the process of splitting the training set in mini-batches evenly distributed across
nodes. Thus, each node only trains the model on a fraction of the total dataset.

Done in SageMaker using SageMaker distributed data parallelism (SMDDP) library

B. Model parallelism is the process of splitting a model up between multiple instances or nodes.

SageMaker model parallelism library v2 (SMP v2)

Guidance on choosing data parallelism compared to model parallelism

• If model can fit on a single GPU's memory but your dataset is large, data parallelism is the
recommended approach. It splits the training data across multiple GPUs or instances for faster
processing and larger effective batch sizes.
• If model is too large to fit on a single GPU's memory, model parallelism becomes necessary. It splits the
model itself across multiple devices, enabling the training of models that would otherwise be intractable
on a single GPU.
Building a deployable model package

Step 1: Upload your model artifact to Amazon S3.

Step 2: Write a script that will run in the container to load the model artifact. In this example, the script
is named inference.py. This script can include custom code for generating predictions, as well as input
and output processing. It can also override the default implementations provided by the pre-built
containers.

To install additional libraries at container startup, add a requirements.txt file that specifies the libraries
to be installed by using pip.

Step 3: Create a model package that bundles the model artifact and the code. This package should
adhere to a specific folder structure and be packaged as a tar archive, named model.tar.gz, with gzip
compression.
2.3 Refine Models
2.3.1 Evaluating Model Performance
Bias and Variance
a) What are these

Bias Variance

Common cause of high model bias vs Variance

Bias Variance
The model is too simple The model is too complex
Incorrect modeling or feature engineering Too much irrelevant data in training dataset
Inherited bias from the training dataset Model trained for too long on training dataset
2.3.2 Model Fit (Overfitting and Underfitting)
1. Overfit/Underfit
• Overfit

1. Reasons

Training data too small Too much irrelevant data Excessive training time Overly complex
architecture

Prolonged training on the A model with too


same data can cause many parameters
model to memorize (weights and
training examples instead biases) can start
of learning underlying memorizing the
patterns. training data and
noise.

2. Detecting model overfitting

Check for model variance Use K-fold cross-validation


If your model performed well with the You split the input data into k subsets of data,
training set but poorly with the validation also called folds. You train multiple models
set, it indicates high variance and on this dataset. For each model, you change
overfitting. which fold is set aside to be the evaluation
dataset. This process is repeated k times

• Underfit

Reasons

Insufficient Data Insufficient Training Time Excessive training time


Model might not have had A model with too few parameters
the opportunity to learn (weights and biases) will likely
the necessary patterns not be able to accurately capture
and relationships in the the nonlinear relationships or
data intricate patterns within the data
instead of learning underlying
patterns.
b) Preventing Overfitting and Underfitting
a) Remediating Overfitting

Early stopping
Pauses the training process
before the model learns the
noise in the data.

Pruning
Aims to remove weights
that don’t contribute much
to the training process

Regularization a) Dropout
Randomly drops out (or sets to 0), a number of neurons in each
layer of the neural network during each epoch.

b) L1 regularization
Push the weights of less important features to zero.

c) L2 regularization
Results in smaller overall weight values (and stabilizes the weights)
when there is high correlation between the input features.
Data augmentation perform data augmentation to increase the diversity of the
training data

Model architecture
simplification
b) Remediating Underfitting

Train for an appropriate


length of time

Use a larger number of data


points

Increase model flexibility a) Add New Domain -specific features


For example, if you have length, width, and height as separate
variables, you can create a new volume feature to be a product
of these three variables.
b) Add Cartesian Products
Consider generating new features through Cartesian products.

c) Change Feature Engineering


Adjusting types of feature processing techniques can increase
model flexibility. E.g., in NLP tasks, you increase the size of n-
grams, etc.
d) Decrease Regularization
Such as reducing the regularization strength or using a different
regularization technique,
c) Combining models for improved performance
Ensembling: Process of combining the predictions or outputs of multiple machine learning models to create
a more accurate and robust final prediction.

The idea behind ensembling is that by combining the strengths of different models, the weaknesses of
individual models can be mitigated. This leads to improved overall performance.

The following are commonly used ensembling methods:

Boosting Bagging Stacking


trains different machine combines multiple models combines both
learning models sequentially trained on different
datasets
When Accuracy Interpretability
Prevents • Overfitting • overfitting
• underfitting

a) Adaptive Boosting
(AdaBoost)
b) Gradient Boosting (GB)
c) Extreme Gradient
Boosting (XGBoost)

Boosting algorithm
Adaptive Boosting (AdaBoost) Gradient Boosting (GB) Extreme Gradient Boosting (XGBoost)
classification • classification • classification
• regression • regression
• large datasets and big data applications.
Bagging (bootstrap aggregation)
Random forests
Stacking
??
2.3.3 Hyperparameter Tuning
Benefits of Hyperparameter tuning
a) Impact of Hyperparameter tuning on model performance

b) Types of hyperparameters for tuning

Gradient Descent algorithm


Learning Rate Batch Size Epochs
Determines the step size taken by # of examples used in each # of passes
the algorithm during each iteration. iteration through the entire
This controls the rate at which the training dataset
training job updates model
parameters.

Careful If the learning rate is too high, the A larger batch size can lead to However, too many
algorithm might overshoot the faster convergence but might can result in
optimal solution and fail to require more computational overfitting.
converge. resources.
Neural networks
# of layers # of neurons in each Choice of activation Regularization
layer functions Techniques
more layers -> more more neurons -> more introduce non-linearity into helps prevent
complex processing power the neural network overfitting .
Common activation functions Common
include: regularization
• Sigmoid function techniques
• Rectified Linear Unit • L1 /L2
(ReLU) regularization
• Hyperbolic Tangent (Tanh) • Dropout
• Softmax function • Early stopping
increasing the depth Increasing number of
of a network risks neurons risks
overfitting. overfitting
Decision Tree
Maximum Depth of tree # of neurons in each layer Choice of activation functions
helps manage complexity Sets a threshold that the data must Options to select how algorithm
of the model and prevent meet before splitting a node. evaluates node splits:
overfitting prevents the tree from creating too • Gini impurity: measures purity of
many branches. This also helps to data and the likelihood that data
prevent overfitting could be misclassified.
• Entropy: Measures randomness of
data. The child node that reduces
entropy the most is the split that
should be used.
Hyperparameter tuning techniques
Pros Cons When
Manual When you have a good time-consuming Domain knowledge, and
understanding of the problem prior experience with
at hand similar problems

Grid search Systematic and exhaustive approach to hyperparameter tuning. It involves defining all
possible hyperparameter values and training and evaluating the model for every
combination of these values.
Reliable technique, Computationally Small scale and
especially for smaller-scale expensive. accuracy
problems.

Random
search
More efficient than Grid Optimum hyperparameter
Search combination could be
missed.

Bayesian
optimization
Uses the performance of previous hyperparameter selections to predict which of the
subsequent values are likely to yield the best results.
• can handle composite • More complex to multiple objectives
objectives implement. and/or speed.
• can also converge faster • Works sequentially, so
than random search. difficult to scale.

Hyperband Dynamically allocates resources to well-performing configurations and stops


underperforming ones early.
• can train multiple models • Not for regular alogos • Only be used for
in parallel iterative algorithms
• can be a more efficient like neural networks
allocation of compute • For limited resources
resources than grid search
or random search
Hyperparameter tuning using SageMaker AMT
STEPS
1. Define your environment and resources, such as output buckets, training set, and validation set.

2. Specify the hyperparameters to tune and the range of values to use for each of the
following: alpha, eta, max_depth, min_child_weight, and num_round.

3. Identify the objective metric that SageMaker AMT will use to gauge model performance.

4. Configure and launch the SageMaker AMT tuning job, including completion criteria to stop tuning after
the criteria have been met.

5. Identify the best-performing model and the hyperparameters used in its creation.

6. Analyze the correlation between the hyperparameters and objective metrics.


2.3.4 Managing Model Size
Model Size Overview
a) Model Size considerations

b) Model size reduction technique: Compression

Pruning
Pruning is a technique that removes the least
important parameters or weights from a model.

Quantization
Quantization changes the representation of weights
to its most space-efficient representation.
E.g., instead of a 32-bit floating-point representation
of weight, quantization has the model use an 8-bit
integer representation.
Knowledge distillation
With distillation, a larger teacher model transfers
knowledge to a smaller student model. The student
model is trained on the same dataset as the teacher.
However, the student model is also trained on the
teacher model's knowledge of the data.
2.3.5 Refining Pre-trained models
Benefits of Fine tuning
a) Where fine-tuning fits in the training process

Reasons for fine-tuning

• To customize your model to your specific business needs

• To work with domain-specific language, such as industry jargon, technical terms, or other specialized
vocabulary

• To have enhanced performance for specific tasks

• To have accurate, relative, and context-aware responses in applications

• To have responses that are more factual, less toxic, and better aligned to specific requirements

b) Fine-tuning approaches

Domain adaption Instruction adaption


Adapting foundation models to specific tasks by Uses labeled examples to improve the
using limited domain-specific data performance of a pre-trained foundation model on
a specific task.
Fine tuning Models with Custom Datasets on AWS
With a Custom Dataset Using Amazon With a Custom Dataset Using Amazon Bedrock
SageMaker JumpStart
1. Navigate to the model detail card of your 1. Choose a custom model in Amazon Bedrock.
choice in SageMaker JumpStart. 2. Create a fine-tuning job.
2. Edit your model artifact location. 3. Configure the model details.
3. Enter your custom dataset location. 4. Configure the job.
4. Adjust the hyperparameters of the training 5. Select your custom dataset.
job. 6. Adjust the hyperparameters.
5. Specify the training instance type.
6. Start the fine-tuning job.

Catatrosphic Forgetting Prevention


Catastrophic forgetting occurs when a model is trained on a new task or data, and it forgets previously learned
knowledge.

a) Detecting
Plot your model's performance over time. If the Make sure your validation sets are representative
model's performance on specific tasks decreases of historic patterns in the data that are still
significantly after training on new data, it might be a relevant to the problem.
sign of catastrophic forgetting.

b) Preventing

To prevent catastrophic forgetting, consider the following techniques:


1. Elastic weight consolidation (EWC): regularization technique that predicts which weights are important to
performing previously learned tasks. It adds a penalty term to the loss function that protects these weights
when the model is fine-tuned or re-trained on new task-specific data. Monitoring the EWC can indicate how
much the model is forgetting older knowledge.
2. Rehearsal: This approach includes samples from the original training set during the fine-tuning or re-training
process. During this process, the model rehearses the previous task to help it retain the learned knowledge.
3. Model design: You can also design your model with the appropriate amount of complexity to learn and
retain patterns in the data. You can also use enough features to make sure your model captures diverse
patterns in the data that differentiate between tasks.
4. Renate: This is an open source Python library for automatic model re-training of neural networks. Instead of
working with a fixed training set, the library provides continual learning algorithms that can incrementally
train a neural network as more data becomes available.
2.3.6 Model Versioning
Benefits of SageMaker Model Registry
a) SageMaker Model Registry

b) Benefits
• Catalog models for production
• Manage model versions
• Control the approval status of models within your ML pipeline

Registering and Deploying models with SageMaker Model Registry


a) SageMaker Model Registry
2.4 Analyze Model Performance

2.4.1 Model Evaluation


Model Metrics and Evaluation Techniques

a) Classical Algorithm problems


Accuracy Precision Recall F1 score AUC Curve
# of matching Proportion of proportion of precision + recall D
predictions to the positive that are correct sets that
total number of correct. are identified as
instances . positive.

Cost of false Cost of false


positives is high negatives is high
(Better to have
false +ves)
classification (e.g. diagnose
Emails as spam cancer)
or not

New one: Heat Maps


graphical identify the areas of interest that
representations that have the most impact when your
use color coding to model is making predictions.
visualize the
performance of a
model
b) Regression Algorithm problems

Metric Description When to Use

Average of squared differences • When larger errors should be penalized more


Mean Squared
between predicted and actual • For comparing models (lower is better)
Error (MSE)
values • When the scale of errors is important

• When you want the error in the same units as the target
Square root of MSE, in the
Root Mean Square variable
same units as the target
Error (RMSE) • For easier interpretation of the error magnitude
variable
• When comparing models with different scales

Proportion of variance in the • To understand how well the model fits the data
R-Squared (R²) dependent variable explained • When you want a metric bounded between 0 and 1
by the independent variables • For comparing models across different datasets

• When comparing models with different numbers of


Modified version of R-Squared
Adjusted R- predictors
that adjusts for the number of
Squared • To penalize overly complex models
predictors in the model
• In feature selection processes
Model Convergence
Convergence refers to the ability of a model to reach an optimal solution during the training process.
Failure to converge can lead to suboptimal performance, overfitting, or even divergence, where the
model's performance deteriorates over time.

a) Impact of convergence

b) How SageMaker AMT (Compiler) helps in convergence issues of convergence

This is where SageMaker AMT can help. It can automatically tune models by finding the optimal
combination of hyperparameters, such as

i. learning rate schedules


ii. initialization techniques
iii. regularization methods.

Improve CNN

How SageMaker AMT improves issues with local maxima and local minima

When training a deep CNN for image classification tasks can encounter saddle points or local minima. This is
because the loss function landscape in high-dimensional spaces can be complex. Having multiple local minima
and saddle points can trap the optimization algorithm, leading to suboptimal convergence.

This is where SageMaker Training Compiler can help. It can automatically apply optimization techniques like

• tensor remapping
• operator fusion
• kernel optimization.
Debug Model Convergence with SageMaker Debugger

SageMaker Clarify and Metrics Overview


Bias metrics give visibility into model evaluation process

• Class Imbalance: Measures the imbalance in the distribution of


classes/labels in your training data.
Data bias • Facet Imbalance: Evaluates the imbalance in the distribution of facets or
metrics sensitive attributes, such as age, gender, or race across different classes or
labels.
• Facet Correlation: Measures the correlation between facets or sensitive
attributes and the target variable.
• Differential validity: Evaluates the difference in model performance such as
accuracy, precision, and recall across different facet groups.
Model bias • Differential prediction bias: Measures the difference in predicted outcomes
metrics or probabilities for different facet groups, given the same input features.
• Differential feature importance: Analyzes the difference in feature
importance across different facet groups, helping to identify potential biases
in how the model uses features for different groups.
• SHAP (SHapley Additive exPlanations): Provides explanations for individual
predictions by quantifying the contribution of each feature to the model's
Model output.
explainability • Feature Attribution: Identifies the most important features contributing to a
metrics specific prediction, helping to understand the model's decision-making
process.
• Partial Dependence Plots (PDPs): Visualizes the marginal effect of one or
more features on the model's predictions, helping to understand the
relationship between features and the target variable.
Data quality • Missing Data: Identifies the presence and distribution of missing values in
metrics your training data.
• Duplicate Data: Detects duplicate instances or rows in your training data.
• Data Drift: Measures the statistical difference between the training data and
the data used for inference or production, helping to identify potential
distribution shifts.
Domain 3: Selecta a deployment infrastructure
3.1 Select a Deployment Infrastructure
3.1.1 Model building & Deployment Infra
Building a Repeatable Framework
a) Example pipeline sequences -Options
Workflow Orchestration Options
a) Comparisons

Deployment Option When to Use

• When working entirely within the AWS SageMaker ecosystem


SageMaker Pipelines
• For end-to-end ML workflows that need to be automated and managed at scale

• For serverless orchestration of ML pipelines


• When you need to integrate ML workflows with other AWS services
AWS Step Functions
• For complex workflows with branching and parallel execution
• When you want visual representation of workflow

Amazon MWAA • When you're familiar with Apache Airflow and prefer DAG-based workflows
(Managed Workflows for • For complex scheduling requirements
Apache Airflow) • When you need to integrate with both AWS and non-AWS services

• When you need an open-source platform for the complete ML lifecycle


• For tracking experiments, packaging code into reproducible runs, and sharing
and deploying models
MLflow
• When working in a multi-cloud or hybrid cloud environment
• When you want to use a tool that integrates well with many ML frameworks and
libraries

• For container orchestration of ML workflows and deploying ML models at scale


• When you need fine-grained control over resource allocation and scheduling
Kubernetes • For multi-cloud or hybrid cloud deployments
• When you want to leverage Kubernetes' extensive ecosystem (e.g., Kubeflow for
ML-specific workflows

b) Comparisons: AWS Controllers for Kubernetes (ACK) and SageMaker Components for Kubeflow
Pipelines.

AWS Controllers for Kubernetes (ACK) SageMaker Components for Kubeflow


Pipelines
• SageMaker Operators for Kubernetes facilitate • You can move your data processing and training
the processes for developers and data scientists jobs from the Kubernetes cluster to the
who use Kubernetes to train, tune, and deploy SageMaker ML-optimized managed service.
ML models in SageMaker. • You have an alternative to launching your
• You can install SageMaker Operators on your compute-intensive jobs from SageMaker.
Kubernetes cluster in Amazon Elastic • You can create and monitor your SageMaker
Kubernetes Service (Amazon EKS). resources as part of a Kubeflow Pipelines
• You can create SageMaker jobs by using the workflow.
Kubernetes API and command-line Kubernetes • Each of the jobs in your pipelines run on
tools, such as kubectl. SageMaker instead of the local Kubernetes
cluster so you can take advantage of key
SageMaker features.
3.1.2 Inference Infrastructure
Deployment Considerations & Deployment Infrastructure
a) Deployment Targets
Best practice: When

Benefits Keep in mind Choose when.. Use case


• Fully managed • Not as customizable • You want a fully • A bank decides to
service as other options managed solution use SageMaker
• Convenient to • Potentially higher with minimal endpoints to deploy
deploy and cost than other operational overhead ML models that
scale options and don't require detect fraud.
SageMaker
• Built-in advanced
endpoints
monitoring and customization.
logging
• Supports
various ML
frameworks
• Highly scalable • Possible higher • You need advanced • A biomedical
and flexible operational deployment company uses EKS
• Supports overhead scenarios and clusters to process
advanced • Steeper learning customized DNA sequencing
EKS deployment curve to manage configurations, and data.
scenarios tool effectively you have the •
• Supports resources to manage
custom the Kubernetes
configurations cluster
• Managed • Limited advanced • You want a managed • A renewable energy
container features compared container firm uses Amazon
orchestration to Kubernetes orchestration service ECS to scale solar
service • Vendor lock-in with good AWS energy forecasting
• Convenient to integration and you workloads.
ECS scale don't require
• Integrates well advanced
with other AWS Kubernetes features
services
• Can run in Batch
mode
• Serverless • Limited run time • You have lightweight, • A telehealth
• Automatically • Cold starts can low-latency models company uses
scales impact latency and want a Lambda functions
Lambda
• Low operational • Not suitable for serverless, pay-per- for appointment
overhead long-running or use solution reminders.
complex models
Choosing a model inference strategy
a) Amazon SageMaker inference options
SageMaker provides multiple inference options, including real-time, serverless, batch, and
asynchronous to suit different workloads.

Inference
Description When to Choose
Option

• When you need immediate responses (e.g., real-time


fraud detection)
• For applications requiring consistent, low-latency
For low latency, high
Real-time predictions
throughput requests
• When your model can handle requests within
milliseconds
• For high-traffic applications with steady request rates

• For unpredictable or sporadic workloads


• When you want to avoid managing and scaling
Handles intermittent traffic
infrastructure
Serverless without managing
• For cost optimization in scenarios with variable traffic
infrastructure
• For dev/test environments or proof-of-concept
deployments

• For time-insensitive inference requests


• When dealing with large input payloads (e.g., high-
Queues requests and handles resolution images)
Asynchronous
large payloads • For long-running inference jobs (up to 15 minutes)
• When you need to decouple request submission from
processing

• For offline predictions on large datasets


• When you need to process data in bulk (e.g., nightly
batch jobs)
Processes large offline
Batch Transform • For scenarios where real-time predictions are not
datasets
required
• When you want to precompute predictions for faster
serving
Container and Instance Types for Inference
a) Choosing the right container for Inference

Approach Description When to Choose

• When using standard ML frameworks (e.g.,


TensorFlow, PyTorch, Scikit-learn)
SageMaker managed Pre-built containers with • For quick deployment without custom code
container images inference logic included • When built-in inference logic meets your needs
• To leverage SageMaker's optimizations and best
practices

• When you need custom preprocessing or


postprocessing
Custom containers with • For proprietary algorithms or frameworks not
Your own inference
your specific inference supported by SageMaker
code
logic • When you require specific dependencies or
libraries
• For full control over the inference environment

b) Choosing the right compute resources (AWS instance)

Instance family Workload type


t family Short jobs or notebooks
m family Standard CPU to memory ratio
r family Memory-optimized
c family Compute-optimized
p family Accelerated computing, training, and inference
g family Accelerated inference, smaller training jobs
Amazon Elastic Inference Cost-effective inference accelerators

When to choose CPU, GPU of Inf2

Amazon EC2 P5 instances GPU-based instances: C2 Inf2 instances:


• High serial • High throughput at • Accelerator designed for ML
performance desired latency inference
• Cost efficient for • Cost efficient for high • High throughput at lower
smaller models utilization cost than GPUs
• Broad support for • Good for deep learning, • Ideal for models that AWS
models and large models Neuron SDK supports
frameworks
Optimizing Deployment with Edge Computing
a) Using edge devices - AWS Options

AWS IoT Greengrass

Amazon SageMaker Neo

b) When to use which

AWS IoT Greengrass SageMaker Neo


• Run at the edge: Bring intelligence to • Optimize models for faster
edge devices, such as for anomaly inference: SageMaker Neo can optimize
detection in precision agriculture or models trained in frameworks like TensorFlow,
powering autonomous devices. PyTorch, and MXNet to run faster with no loss in
• Manage applications: Deploy new or accuracy.
legacy apps across fleets using any • Deploy models to SageMaker and edge
language, packaging technology, or run devices: SageMaker Neo can optimize and
time. compile models to run on SageMaker hosted
• Control fleets: Manage and operate inference platforms, like SageMaker endpoints.
device fleets in the field locally or As you've learned, it can also help you to run
remotely using MQTT or other models on edge devices, such as phones,
protocols. cameras, and IoT devices.
• Process locally: Collect, aggregate, • Model portability: SageMaker Neo can convert
filter, and send data locally. compiled models between frameworks, such
as TensorFlow and PyTorch. Compiled models
can also be run across different platforms and
hardware, helping you to deploy models to
diverse target environments.
• Compress model size: SageMaker Neo
quantizes and prunes models to significantly
reduce their size, lowering storage costs and
improving load times. This works well for
compressing large, complex models for
production.
3.2 Create and Script Infrastructure
These pillars provide a consistent and scalable designs.

The security pillar

• create ML solutions that anonymize sensitive data, such as personally identifiable information
• guides the configuration of least-privileged access to your data and resources
• suggests configurations for your AWS account structures and Amazon Virtual Private Clouds to
provide isolation boundaries around your workloads.

The reliability pillar

• helps construct ML solutions that are resistant to disruption while recovering quickly
• guides you to design data processing workflows to be resilient to failures by implementing
error handling, retries, and fallback mechanisms
• recommends data backups, and versioning.

The performance efficiency pillar

• focuses on the efficient use of resources to meet requirements.


• help you optimize ML training and tuning jobs by selecting the most suitable EC2 instance
types for a particular task, running model inference using edge computing to minimize latency
and maximize performance.

The cost optimization pillar

• focuses on building and operating systems that minimize costs


• In the data processing stage -> guides storage resource selection and tools for automation
such as Amazon SageMaker Data Wrangler.
• During model development -> rightsizing compute resources
• Finally, during model deployment -> auto scaling

The sustainability pillar

• focuses on environmental impacts (energy consumption, efficient resource usage)

The operational excellence pillar

• focuses on the efficient operation, performance visibility, and continuous improvement


3.2.1 Methods for Provisioning Resources
IAC
a) Tools

Language Multi-Cloud
Tool Description Typical Use Cases
Support Support

• AWS-only
AWS-native IaC JSON, • Teams familiar with AWS
CloudFormation AWS only ecosystem
service YAML
• Simple to moderate complexity
deployments
TypeScript, • Teams with strong programming
IaC framework skills
CDK (Cloud Python, AWS only (can
that compiles to • Complex AWS infrastructures
Development Kit) Java, C#, be extended)
CloudFormation • Reusable ML infrastructure
Go
components
• Multi-cloud ML deployments
Open-source IaC HCL,
Terraform Excellent • Hybrid cloud scenarios
tool JSON • Teams preferring declarative
syntax
• Requires complex logic
TypeScript,
Modern IaC • Teams preferring familiar
Pulumi Python, Excellent programming languages
platform
Go, .NET • Multi-cloud, complex
architectures
Working with CloudFormation
a) Template

"AWSTemplateFormatVersion" : "2010-09-09" Format version


This first section identifies the AWS CloudFormation
template version to which the template conforms.
"Description" : "Write details on the template." Description
This text string describes the template.
"Metadata" : { Metadata
"Instances" : {"Description" : "Info on instances"}, These objects provide additional information about
"Databases" : {"Description" : " Info about dbs"} the template.
}
"Parameters" : { Parameters
"InstanceTypeParameter" : { Values passed to your template when you create or
"Type" : "String", update a stack. You can refer to parameters from the
"Default" : "t2.micro", Resources and Outputs sections of the template.
"AllowedValues" : ["t2.micro", "m1.small"],
"Description" : "Enter t2.micro or m1.small”
}
}
"Rules" : { Rules
"Rule01": { Rules validate parameter values passed to a template
"RuleCondition": { during a stack creation or stack update.
...
},
"Assertions": [
...
]}
}
"Mappings" : { Mappings
"Mapping01" : { These are map keys and associated values that you
"Key01" : { can use to specify conditional parameter values. This
"Name" : "Value01" is similar to a lookup table
}, ...
}}
"Conditions" : { Conditions
"MyLogicalID" : {Intrinsic function} Control whether certain resources are created, or
} whether certain resource properties are assigned a
value during stack creation or an update.
"Transform" : { Transform
set of transforms For serverless applications, transform specifies the
} version of the AWS SAM to use.
"Resources" : { Resources
"Logical ID of resource" : { This section specifies the stack resources, and their
"Type" : "Resource type", properties that you would like to provision. You can
"Properties" : { refer to resources in the Resources and Outputs
Set of properties sections of the template.
}} Note: This is the only required section of the template.
}
"Outputs" : { Outputs
"Logical ID of resource" : { Describe the values that are returned whenever you
"Description" : "Information on the value", view your stack's properties. For example, you can
"Value" : "Value to return", declare an output for an Amazon S3 bucket name and
"Export" : { then call the aws cloudformation describe-
"Name" : "Name of resource to export" stacks AWS CLI command to view the name.
}}}
b) CF Stacks

c) Provisioning stacks using CloudFormation templates

$ aws cloudformation create-stack \


--stack-name myteststack \
--template-body file:///home/testuser/mytemplate.json \
--parameters ParameterKey=Parm1,ParameterValue=test1
ParameterKey=Parm2,ParameterValue=test2
Working with CDK
The AWS CDK consists of two primary parts:

• AWS CDK Construct Library: This library contains a collection of pre-written modular and reusable
pieces of code called constructs. These constructs represent infrastructure resources and collections
of infrastructure resources.
• AWS CDK Toolkit: This is a command line tool for interacting with CDK apps. Use the AWS CDK Toolkit
to create, manage, and deploy your AWS CDK projects.

a) CDK Construct level comparisons


Abstract Ease of Typical Use Case
Level Description
ion Use
Direct CF resources full control over CF resources
L1 Low Low
representation
L2 Logical grouping of L1 resources Medium Medium most common

High-level abstractions that Quickly deploy common


L3 High High architectural patterns
represent complete solutions

b) CDK LifeCycle

cdk init

When you begin your CDK project, you create a directory for it, run cdk init, and
specify the programming language used:

• mkdir my-cdk-app
• cd my-cdk-app
• cdk init app --language typescript
cdk bootstrap

You then run cdk bootstrap to prepare the environments into which the stacks
will be deployed. This creates the special dedicated AWS CDK resources for the
environments.
cdk synth

Creates the CloudFormation templates using the cdk synth command.


cdk deploy

Finally, you can run the cdk deploy command to have CloudFormation provision
resources defined in the synthesized templates.
Comparing CF and CDK
The AWS CDK consists

AWS CloudFormation AWS CDK


Authoring CF only uses JSON or YAML templates to M modern programming
experience define your infrastructure resources. languages, like Python,
TypeScript, Java, C#, and Go.
IaC approach CloudFormation templates are declarative. AWS CDK provides an
You define the desired state of your imperative approach to
infrastructure and CloudFormation handles generating CloudFormation
the provisioning and updates. templates, which are
declarative, means you can
introduce logic and conditions
that determine the resources to
provision in your infrastructure.
Debugging and Troubleshooting CloudFormation templates With the CDK, you can use the
troubleshooting requires learning specific CloudFormation debugging capabilities of your
error handling and messages. chosen programming language,
making it more convenient to
identify and fix issues in your
infrastructure code.
Reusability and create nested stacks and cross-stack Supports programming
modularity references, resulting in modular and reusable languages that you can use to
infrastructure designs. However, this apply object-oriented
approach can become complex and difficult programming principles. This
to manage as your infrastructure grows. makes it more convenient to
create modular and reusable
IaC code blocks for your
infrastructure.
Community CloudFormation has been around for a longer Newer offering than AWS
support time and has a larger community for support. CloudFormation, but it is
It also has a variety of third-party tools and rapidly gaining adoption.
resources.
Learning curve steeper learning curve for developers who are If you're already familiar with
used to a more programmatic approach over programming languages like
a template-driven approach. Python or TypeScript, AWS CDK
will have a gentler learning
curve.
3.2.2 Deploying and Hosting Models

SageMaker Python SDK


a) Creating pipelines with the SageMaker Python SDK to orchestrate workflows

pipeline = Pipeline(
name=pipeline_name,
parameters=[input_data, processing_instance_type,
processing_instance_count, training_instance_type,
mse_threshold, model_approval_status],
steps = [step_process, step_train, step_evaluate, step_conditional]
)
b) Automating common tasks with the SageMaker Python SDK

Preparing data (the.run())


With Amazon SageMaker Processing, you can run processing jobs for data processing steps in your
machine learning pipeline. Processing jobs accept data from Amazon S3 as input and store data into
Amazon S3 as output.

I.Creating the Processor


To define a processing job, you first create a Processor. The following example instantiates
the SKLearnProcessor() class, which streamlines using scikit-learn in your data processing
step:

II.Running the Processor


You then use the .run() method on the processor to run a processing job.

III.Adding the data preprocessing step to a SageMaker Pipeline


Finally, you define a data preprocessing step in your pipeline using ProcessingStep():
Training Models (the.fit())
You can run model training jobs using the SageMaker Python SDK. The following model training job
example manages the training script, framework, training instance, and training data input.

I.Creating the estimator for the training job


To define a model training job, you instantiate the estimator class. This class encapsulates
training on SageMaker. The following code creates a model training job using the MXNet() class to
train a model using the MXNet framework:

II.Running the training job


After you create the estimator, you can then use the .fit() method to run the training job. This
method takes an argument that identifies the path to the training data. In this example, the training
dataset is stored in Amazon S3:

III.Creating a model training step in SageMaker Pipelines


Finally, you define a model training step in your pipeline using the TrainingStep() method:

Deploying Models (deploy())


You can use SageMaker Python SDK to deploy a SageMaker model endpoint using
the deploy() and predict() methods. You start by defining your endpoint configuration. The following code
shows the configuration for a serverless endpoint:

You use this configuration in a deploy() method. If the model is already created, you use the Model class to
create a SageMaker model from a model artifact. The model artifact location in Amazon S3 and the code to
use to perform inference as the entry_point is specified:

After deployment is complete, you can use the predictor’s predict() method to invoke the serverless
endpoint:
c) Building and Maintaining Containers

Training Container
Inference container

Below example is for Python script

Point File
serve.py
when the container is started for hosting. It starts the inference server, including
the nginx web server and Gunicorn as a Python web server gateway interface.
predictor.py
This Python script contains the logic to load and perform inference with your
model. It uses Flask to provide the /ping and /invocations endpoints.
wsgi.py
This is a wrapper for the Gunicorn server.
nginx.conf
This is a script to configure a web server, including listening on port 8080. It
forwards requests containing either /ping or /invocations paths to the Gunicorn
server.

When creating or adapting a container for performing real-time inference, your container must
meet the following requirements:

• Your container must include the path /opt/ml/model. When the inference container starts, it
will import the model artifact and store it in this directory.

Note: This is the same directory that a training container uses to store the newly trained model
artifact.

• Your container must be configured to run as an executable. Your Dockerfile should include an
ENTRYPOINT instruction that defines an executable to run when the container starts, as
ENTRYPOINT ["<language>", "<executable>"]
e.g. ENTRYPOINT ["python", "serve.py"]

• Your container must have a web server listening on port 8080.

• Your container must accept POST requests to the /invocations and /ping real-time endpoints.

• The requests that you send to these endpoints must be returned with 60 seconds and have a
size <6 MB.
Auto scaling strategy
a) SageMaker model auto scaling methods

Scaling Method Description Use Case Key Features

• Specify a metric and target


Adjusts capacity to When you want to maintain a value
Target tracking
maintain a specified specific metric (e.g., CPU • Automatically adds or removes
scaling policy capacity
metric near a target value utilization) at a target level
• Good for maintaining
consistent performance
• Define multiple thresholds and
Defines multiple policies When you need more granular corresponding scaling actions
Step scaling
for scaling based on control over scaling actions at • More aggressive response to
policy demand changes
specific metric thresholds different metric levels
• Allows fine-tuning of scaling
behavior
• Set one-time or recurring
Scales resources based When demand follows a schedules
Scheduled
on a predetermined predictable pattern (daily, • Use cron expressions with
scaling policy
schedule weekly, monthly, yearly) start and end times
• Ideal for known traffic patterns
• Full manual control over
scaling
Manually increase or For unpredictable or one-off
On-demand • Useful for new product
decrease the number of events that require manual launches, unexpected traffic
scaling
instances intervention spikes, or special promotions
• Flexibility to respond to
unforeseen circumstances
3.3 Automate Deployment
3.3.1 Introduction to DevOps
Code repositories
a) GitHub vs GitLab

Feature GitHub GitLab

Cloud-hosted, GitHub Enterprise Cloud-hosted, Self-hosted (Community and


Hosting Options
(self-hosted) Enterprise Editions)

CI/CD GitHub Actions (built-in) GitLab CI/CD (built-in)

Project
Projects, Kanban boards Issue boards, Epics, Roadmaps
Management

Third-party
Extensive marketplace Fewer, but strong built-in tools
Integrations

Open Source Many open-source projects Fully open-source core


3.3.2 CI/CD: Applying DevOps to MLOps
Introduction to MLOps
a) CICD in ML Lifecycle

b) Teams in the ML process


• Data engineers: Data engineers are responsible for data sourcing, data cleaning, and data processing.
They transform data into a consumable format for ML and data scientist analysis.
• Data scientists: Responsible for model development including model training, evaluation, and
monitoring to drive insights from data.
• ML engineers: Responsible for MLOps - model deployment, production integration, and monitoring.
They standardize ML systems development (Dev) and ML systems deployment (Ops) for continuous
delivery of high-performing production ML models.
c) Nonfunctional requirements in ML
• Consistency
• Flexibility: Accommodates a wide range of ML frameworks and technologies to adapt to changing
requirements.
• Reproducibility
• Reusability:
• Scalability:
• Auditability: Provides comprehensive logs, versioning, and dependency tracking of all ML artifacts for
transparency and compliance.
• Explainability: Incorporates techniques that promote decision transparency and model interpretability.

d) Comparing the ML workflow with DevOps


Automating testing in CI/CD Pipelines
SageMaker projects with CI/CD practices

• Unit tests
validate smaller components like individual functions or methods.
• Integration tests
can check that pipeline stages, including data ingestion, training, and deployment, work together
correctly. Other types of integration tests depend on your system or architecture.
• Regression tests
In practice, regression testing is re-running the same tests to make sure something that used to work
was not broken by a change.

Version Control Systems: Getting started with Git


SageMaker projects with CI/CD practices
Continuous Flow Structures : Automate deployment
a) Key components

1) Model training and versioning:


2) Model packaging and containerization:
3) Continuous integration (CI):
4) Monitoring and observability:
5) Rollback and rollforward strategies:

b) Gitflow and GitHub flows

Feature Gitflow GitHub Flow

Complexity More complex Simpler

Main Branches main and develop Single main branch

Feature
Feature branches from develop Feature branches from main
Development

Release Process Dedicated release branches Direct to main via pull requests

Hotfixes Separate hotfix branches Treated like features

Suited For Scheduled releases, larger projects Continuous delivery, smaller projects

Integration Branch develop branch N/A (uses main)

Learning Curve Steeper Flatter

Flexibility More rigid structure More flexible

c) GitFlow
3.3.3 AWS Software Release Processes
Continuous Delivery Services
a) AWS CI/CD Pipeline

CodePipeline • Each AWS account: 1000 pipelines


• The number of actions in a single pipeline: 500.
Provides configurable manual approval gates
• The size of input artifact for a single action: 1 GB
to control releases, detailed monitoring
• #of custom actions for/region/account: 50.
capabilities, and granular permissions to • # of webhooks/region/account : 300.
manage pipeline access, as we will explore
further.

CodeBuild • detailed logging, auto-scaling capacity, and high


availability for builds
CodeBuild sets service quotas and limits on • integrates with other AWS services like CodePipeline
builds and compute fleets. These quotas are and ECR for end-to-end CI/CD workflows
for each supported AWS Region for each AWS • artifacts can be stored in S3 or other destinations
account. • build can be monitored through the CodeBuild
console, Amazon CloudWatch, and other methods
• fine-grained access controls for build projects using
IA) policies.
CodeDeploy • facilitates automated deployments to multiple
environments
CodeDeploy is a deployment service that • supporting deployment strategies like blue/green, in-
provides automated deployments, flexible place, and canary deployments
deployment strategies, rollback capabilities,
• provides rollback capabilities,
and integration with other AWS services to
• detailed monitoring and logging, and integration with
help manage the application lifecycle across
services like EC2, Lambda, and ECS.
environments.
Best Practices for Configuring & Troubleshooting
Service Purpose Key Configuration Steps Troubleshooting Tips

• Validate buildspec file


1. Create a CodeBuild • Verify IAM permissions
project • Review service limits
Compiles source code,
2. Define build specification
runs unit tests, and • CloudTrail
CodeBuild 3. Configure build
produces deployment- environment Unique
ready artifacts 4. Set up build artifacts
5. Configure CloudWatch • Check for network issues
Logs (optional) • Check CodeBuild logs

Unique
1. Set up IAM role • Review CodeDeploy logs
Automates application
2. Create CodeDeploy app • Verify CodeDeploy agent
CodeDeploy deployments to various 3. Create deployment group
compute platforms • Validate AppSpec file
4. Define deployment
configuration • Check instance health
• Analyze the rollback reason

• Validate build specifications


((buildspec.yml)
1. Create CodePipeline • Verify input configuration
pipeline • Use CloudTrail
Models, visualizes, and
2. Add source stage • Check the IAM permissions
CodePipeline automates software 3. Add build stage
release steps 4. Add deploy stage Unique
5. Review and create • Use CloudWatch
pipeline
• Examine runtime details
• Check pipeline history

Automating Data Integration in ML Pipeline


Code Pipeline vs Step Functions

Aspect AWS CodePipeline AWS Step Functions

Primary Purpose CI/CD and release automation Workflow orchestration and coordination

Workflow Type Linear, predefined stages Complex, branching workflows with conditional logic

Standard software deployment Complex, multi-step processes and microservices


Best For
pipelines orchestration
MLOps with Code Pipeline and Step Functions
• MLOps Overview:
o Set of practices and tools for streamlining ML model deployment, monitoring, and
management
o Focuses on automating ML workflows in production environments
• AWS Step Functions:
o Fully managed visual workflow service for building distributed applications
o Represents pipeline stages (preprocessing, training, evaluation, deployment) as task
states
o Manages control flow between stages
o Can integrate with Lambda, AWS Batch, and other AWS services
• AWS CodePipeline:
o Fully managed continuous delivery service
o Automates release pipelines for MLOps workflows
o Represents each stage of the pipeline as an action
• Integration of Step Functions and CodePipeline:
a) CodePipeline invokes MLOps pipeline based on events (e.g., new model version
commit)
b) Pipeline stages include source code management, model building, testing, and
deployment
c) CodePipeline starts Step Functions state machine to initiate MLOps workflow
d) Can pass input data or parameters to configure the workflow
• Benefits of Integration:
o Enables efficient movement of models through development lifecycle
o Automates the entire process from training to production deployment
o Provides flexibility in configuring and managing complex ML workflows
Deployment strategies

Comparison deployment

Blue/green deploy Canary deployment Rolling deployment


maintaining two identical Gradually rolling out a new model Gradually replace previous
prod environments, one version to a small portion of users deployment of model version
blue (existing) and one with new version by updating
green (new), and endpoint in configurable batch
gradually shifting traffic sizes
between them.

• When you need instant • To test new features with a subset • When you have stateful
rollback capability of users application
• For critical • When you want to gather user • For large-scale deployments
applications requiring feedback before full release where cost is an issue
zero downtime • For applications with high traffic • When you can tolerate having
• When your application where you want to minimize risk mixed versions temporarily
can handle sudden
traffic shifts

The baking period is a set time for monitoring the green fleet's performance before completing the full transition,
making it possible to rollback if alarms trip. This period builds confidence in the new deployment before permanent
cutover.
3.3.4 Retraining models
Retraining models
a) Retraining mechanisms

Automated retraining Invoke model retraining when new


pipelines data becomes available
Scheduled retraining Periodic jobs to retrain the model at
regular intervals
Drift detection and invoke retraining when the model's SageMaker Model Monitor +
invoked retraining performance starts to degrade. Lambda
(SageMaker Model Monitor can detect
model drift, and Lambda can be used
to initiate the retraining process.)
Incremental learning Incremental learning allows the • XGBoost
model to be updated with new data • Linear Learner
without completely retraining the (AWS SageMaker supports several
model from scratch. algorithms for this, as above )
Experimentation and Retraining can be paired with AWS SageMaker + Personalize
A/B testing experimentation and A/B testing to (AWS SageMaker and Amazon
compare various model versions Personalize can be used to deploy
and manage these experiments.)

b) Catastrophic forgetting during retraining and transfer learning


Catastrophic forgetting is a phenomenon that occurs in machine learning models, particularly in the context of
continual or lifelong learning.

Catastrophic forgetting is a type of over-fitting. The model learns the training data too well such that it no
longer performs well on other data.

Why does catastrophic forgetting happen?


a. Retraining and optimized training: The primary reason for catastrophic forgetting is that the parameters of
the model are typically updated to optimize for the current task. These updates effectively overwrite the
knowledge acquired from previous tasks.
b. Transfer Learning: Transfer learning is an ML approach where a pre-trained model, which was trained on
one task, is fine-tuned for a related task. Organizations can make use of transfer learning to retrain existing
models on new, related tasks using a smaller dataset.
Solving catastrophic forgetting

Method Main Idea Advantages Limitations

• Requires storing old data


Retrain on a subset of old data • Directly addresses forgetting
Rehearsal-based • Can be computationally
along with new data
• Conceptually simple expensive

Modify network architecture to • Can be very effective • May increase model complexity
Architectural
accommodate new tasks • Can be challenging to design
• Doesn't require old data

• Quality depends on model


Generate synthetic data to • Doesn't need storing old data
Replay-based • May not capture all aspects of
represent old tasks • Work well with Gen AI models old data

• Doesn't require old data or • May limit learning of new


Add constraints to limit
Regularization- architecture changes tasks
changes to important
based • Often computationally • Determining importance can
parameters
efficient be challenging

Configuring Inferencing Jobs


a) Inference types

Differences between training and inferencing

Training Inferencing

Requires high parallelism with large batch processing for


Usually run on a single input in real time
higher throughput

More compute- and memory- intensive Less compute- and memory-intensive

Standalone item not integrated into application stack Integrated into application stack workflows
Runs in the cloud Runs on different devices at the edge and cloud
Typically runs less frequently and on an as-needed basis Runs an indefinite amount of time
Compute capacity requirements are typically predictable, Compute capacity requirements might be dynamic
so wouldn't require auto scaling and unpredictable, so would require auto scaling
Domain 4: Monitor Model
4.1 Monitor Model Performance and Data Quality
4.1.1 Monitoring Machine Learning Solutions

Importance of Monitoring in ML
a) Machine Learning Lens: AWS Well-Architected Framework: Best practices and design principles

Best practice: When

Resource pooling Sharing compute, storage, and


networking resources
Optimize resources Caching
Data management data compression, partitioning, and
lifecycle management
AWS Auto Scaling SageMaker built-in scaling. AWS Auto-
Scale ML workloads
Scaling
based on demand
Lambda
Monitor usage and costs resource tagging
Reduce Cost
monitor ROI
Establish Feedback Loops
Enable continuous Monitor Performance SageMaker Model Monitor (Drift)
improvement CloudWatch alerts (deviations)

Automate Retraining
Detecting Drift in Monitoring
a) Drift Types

Drift Type Description Causes Implications

• Real-world data not as curated


Production data • Model accuracy
Data Quality as training data
distribution differs from decreases
Drift • Changes in data collection
training data distribution • Predictions become less
processes
reliable
• Shifts in real-world conditions
• Changes in the underlying
Model predictions differ
Model Quality relationship between features • Decreased model
from actual ground truth and target performance
Drift
labels • Model decay over time • Inaccurate predictions
• Concept drift
• Training data too small or not
representative • Model overgeneralization
Increase in bias affecting • Incorporation of societal • Unfair or discriminatory
Bias Drift model predictions over assumptions in training data predictions
time • Exclusion of important data • Ethical concerns
points • New groups in
• Changes in real-world data production
distribution
• Shifts in feature importance • Model may rely on less
Changes in the
over time relevant features
Feature contribution of individual
• Changes in the underlying • Decreased
Attribution Drift features to model problem domain interpretability
predictions • Introduction of new, more • Potentially reduced
predictive features performance

Note: Bias inverse of variance, which is the level of small fluctuations or noise common in complex data sets.
Bias tends to cause model predictions to overgeneralize, and variance tends to cause models to
undergeneralize. Increasing variance is one method for reducing the impact of bias.
b) Monitoring Drift

Monitoring Type What It Monitors How It Monitors

• Missing values • Implement data validation checks


• Outliers • Calculate statistical metrics
Data Quality • Data types • Compare metrics with baseline values
Monitoring • Statistical metrics (mean, std dev, • Use data drift detection techniques (e.g.,
etc.) Kolmogorov-Smirnov tests, Maximum Mean
• Data distribution Discrepancy)
• Calculate evaluation metrics on held-out test set
• Evaluation metrics (accuracy,
or production data sample
Model Quality precision, recall, F1, AUC, etc.)
• Implement confidence thresholding or
Monitoring • Prediction confidence
uncertainty estimation
• Performance across different
• Flag low-confidence predictions
subpopulations
• Monitor performance on different data subsets
• Calculate bias metrics for different sensitive
groups
• Bias metrics (disparate impact,
Model Bias Drift fairness, etc.) • Compare bias metrics with baseline values or
Monitoring thresholds
• Performance across sensitive
• Implement bias mitigation techniques (e.g.,
groups
adversarial debiasing, calibrated equalized
odds)
• Use interpretability techniques (e.g., SHAP) to
calculate feature attributions
Feature
• Feature importance scores • Calculate statistical metrics on feature
Attribution Drift • Statistical metrics of feature attributions
Monitoring attributions • Compare metrics with baseline values
• Identify features with significantly changed
attributions
SageMaker Model Monitor
Integration

SageMaker - Monitoring for Data Quality Drift

STEPS

1. Initiate data capture on the endpoint


2. Create a baseline

start a baseline processing job with the suggest_baseline method of the


ModelQualityMonitor object using the SageMaker Python SDK.
3. Schedule data quality monitoring jobs
4. Integrate data quality monitoring with Cloudwatch
5. Interpret results and analyze findings

The report is generated as the constraint_violations.json file. The SageMaker Model


Monitor prebuilt container provides the following violation checks.
• data_type_check
• completeness_check
• baseline_drift_check
• missing_column_check
• extra_column_check
• categorical_values_check
SageMaker - Monitoring for Model Quality Drift using Model Monitor

To monitor model quality, SageMaker Model Monitor requires the following inputs:

1. Baseline data
2. Inference input and predictions made by the deployed model
3. Amazon SageMaker Ground Truth associated with the inputs to the model

SageMaker - Monitoring for Bias using Clarify


Statistical bias drift occurs when the data used for training differs from the data encountered during
prediction, leading to potentially biased outcomes. This is prominent when training data changes over
time. In this lesson, you will learn about AWS services that help you monitor for statistical bias drift.

Post-training bias metrics in SageMaker Clarify help us answer two key questions:

• Are all facet values represented at a similar rate in positive (favorable) model predictions?

• Does the model have similar predictive performance for all facet values?

SageMaker Model Monitor automatically does the following:

• Merges the prediction data with SageMaker Ground Truth labels


• Computes baseline statistics and constraints
• Inspects the merged data and generates bias metrics and violations
• Emits CloudWatch metrics to set up alerts and triggers
• Reports and alerts on bias drift detection
• Provides reports for visual analysis

How it works: It quantifies the contribution of each input feature (for example, audio characteristics)
to the model's predictions, helping to explain how the model arrives at its decisions.
Options for using SageMaker Clarify

When to use which

Method Description When to Use

• For one-time or ad-hoc bias analysis


Configure and run Clarify
SageMaker • When you need full control over the analysis
processing job using configuration
Clarify Directly
SageMaker Python SDK API • For integrating bias analysis into custom
workflows
SageMaker Integrate Clarify with Model
• When you want to automate bias detection in
Model Monitor Monitor for continuous bias production
+ Clarify monitoring • If you need to set up alerts for bias drift

• During the data preparation phase


Utilize Clarify within Data
SageMaker • When you want to identify potential bias early
Wrangler during data in the ML pipeline
Data Wrangler
preparation • If you're already using Data Wrangler for data
preprocessing
SageMaker - Monitoring for Feature Attribution Drift (Model Monitor + Clarify)
Feature attribution refers to understanding and quantifying the contribution or influence of each
feature on the model's predictions or outputs. It helps to identify the most relevant features and their
relative importance in the decision-making process of the model.

Uses SHAP

SageMaker Clarify provides feature attributions based on the concept of Shapley value. This is a game-
theoretic approach that assigns an importance value (SHAP value) to each feature for a particular
prediction.

Here's how it works:

1. SageMaker Clarify: This is the core component that performs the actual bias detection
and generate quality metrics and violations
2. SageMaker Model Monitor: This is the framework that can use Clarify's capabilities to
perform continuous monitoring of deployed models.
SageMaker Model Dashboard

Features

1. Alerts :
How it helps: The dashboard provides a record of all activated alerts, allowing the data
scientist to review and analyze past issues.
Alert criteria depend upon two parameters:
• Datapoints to alert: Within the evaluation period, how many runtime failures raise an alert?
• Evaluation period: The # of most recent monitoring executions to consider when evaluating
alert status.
2. Risk rating

A user-specified parameter from the model card with a low, medium, or high value.

3. Endpoint performance

You can select the endpoint column to view performance metrics, such as:

• CpuUtilization: The sum of each individual CPU core's utilization from 0%-100%.
• MemoryUtilization: The % of memory used by the containers on an instance, 0%-100%.
• DiskUtilization: The % of disk space used by the containers on an instance, 0%-100%.

4. Most recent batch transform job

This information helps you determine if a model is actively used for batch inference.

5. Model lineage graphs

When training a model, SageMaker creates a model lineage graph, a visualization of the entire ML
workflow from data preparation to deployment.

6. Links to model details


The dashboard links to a model details page where you can explore an individual model.

Model Monitor vs SageMaker Dashboard vs Clarify: When to use which one


Tool Description Why to use When to Use

• data and model quality • To set up automated alerts for


Continuous issues performance degradation
monitoring of ML • model drift • When you need to monitor resource
Model Monitor
models in utilization
production • Monitor real-time endpoints, batch
transform, On-demand monitoring job

• • For a high-level overview of all


Centralized view of
SageMaker SageMaker activities
SageMaker
Dashboard • To track training jobs, endpoints, and
resources and jobs
notebook instances

• Detecting Bias • To detect bias in training data and


• Triggers statistics and model predictions
Violations report • When you need to explain model
Bias detection and
SageMaker decisions
model explainability
Clarify tool
• For regulatory compliance requiring
model transparency
• To improve model fairness and
accountability
4.1.2 Remediating Problems Identified by Monitoring
Automated remediations and notifications

• Stakeholder notifications: When monitoring metrics indicate changes that impact business
KPIs or the underlying problem
• Data Scientist notification: You can use automated notifications to data scientists when
your monitoring detects data drift or when expected data is missing.
• Model retraining: Configure your model training pipeline to automatically retrain models
when monitoring detects drift, bias, or performance degradation.
• Autoscaling: You use resource utilization metrics gathered by infrastructure monitoring to
initiate autoscaling actions.

Model retraining strategies

Strategy When to Use Advantages Considerations

• When drift is detected • May be frequent if thresholds


• Timely response to
above a certain threshold are too sensitive
Event-driven changes
• In response to significant • Retraining can be expensive
• Efficient use of
changes in data or and time-consuming
resources
performance

• Allows for human • Requires constant monitoring


• When market conditions
judgment in decision- by data scientists or
On-demand change significantly stakeholders
making
• In response to new • May lead to delayed
• Can incorporate
competitors or strategies responses
business context

• Predictable
• When there are known maintenance • May retrain unnecessarily if
seasonal patterns schedule no significant changes occur
Scheduled
• For maintaining model • Can anticipate and • Might miss sudden,
accuracy over time prepare for retraining unexpected changes
periods
4.2 Monitor and Optimize Infrastructure and Costs
4.2.1 Monitor Infrastructure
Monitor Performance Metrics - CloudWatch vs Model Monitor
Feature SageMaker Model Monitor CloudWatch Logs

Continuous monitoring of ML models in Monitoring, storing, and accessing log


Purpose
production files

(all 4 ML monitoring types)


• Log collection from various sources
Key • Data quality monitoring • Log storage in S3
Capabilities • Model quality monitoring • Pattern recognition
• Bias drift monitoring • Log anomaly detection
• Feature attribution drift monitoring
• EC2 instances
Monitoring • Real-time endpoint monitoring • CloudTrail
Types • Batch transform job monitoring • Amazon Route 53
• On-schedule monitoring for async batch jobs • Other sources

Alert System Set alerts for deviations in model quality Notifications based on preset thresholds

• Pre-built monitoring capabilities (no coding) Customizable log patterns and anomaly
Customization
• Custom analysis options detection

Monitoring vs. Observability


Monitoring Observability

Continuous collection and analysis of Deep insights into internal state and behavior of ML
Definition
metrics systems

Understanding complex interactions and


Focus Detecting anomalies and deviations
dependencies

• Collecting metrics • Analyzing system behavior


Key
• Logging • Identifying root causes
Activities
• Alerting • Reasoning about system health

• • Metric collection • Distributed tracing


Techniques • Threshold-based alerting • Structured logging
• Basic log analysis • Advanced data visualization

Detect issues and invoke alerts or Provide deeper insights for troubleshooting and
Outcome
automated actions optimization

Primarily focused on predefined metrics Enables asking and answering questions about
Scope
and thresholds system behavior
Monitoring Tools (for Performance and Latency)
CloudWatch CloudWatch Logs
Feature AWS X-Ray QuickSight
Lambda Insights Insights

Trace information about In-depth performance


Interactive log BI and data
Purpose responses and calls in monitoring for Lambda
analytics service visualization service
applications fns only

• Interactive
querying and
• Works across AWS • Monitors metrics
analysis of log data • Interactive
and third-party (memory, duration,
• Correlates log data dashboards
services invocation count)
from different • ML-powered
Key Features • Generates detailed • Provides detailed
sources
service graphs logs and traces insights
• Visualizes time • Supports various
• Identifies • Helps identify
series data data sources
performance bottlenecks in
• Supports
bottlenecks Lambda functions
aggregations,
filters, and regex
Any service that Various AWS services
Compatible EC2, ECS, Lambda,
Lambda generates logs in and external data
Services Elastic Beanstalk
CloudWatch sources

• Monitor and • Create


optimize ML • Analyze logs from dashboards for ML
• Analyze bottlenecks in
models deployed ML workloads experiment results
ML Use ML systems
as Lambda • Identify patterns • Analyze and
Cases • Trace requests in ML
functions and anomalies in present insights
applications (e.g.,
• Identify root causes ML system from ML
chatbot inference)
of Lambda function behavior predictions
issues
Performance Interactive
Time series graphs, Log
Visualization Service maps, Trace views dashboards, Trace dashboards, Charts,
event views
details Graphs

End-to-end request Detailed Lambda Flexible, interactive log Comprehensive data


Primary
tracing and bottleneck function performance analysis and visualization and
Benefit
identification insights visualization business intelligence

SageMaker w/ EventBridge

Actions that can be automatically invoked using EventBridge:

a) Invoking an AWS Lambda function


b) Invoking Amazon EC2 run command (not create or deploy)
c) Relaying event to Kinesis Data Streams
d) Activating an AWS Step Functions state machine.
e) Notifying SNS topic or an AWS Server Migration Service (AWS SMS) queue.
4.2.2 Optmize Infrastructure
Inference Recommender types
a) Inference Recommendation Types
Default Advanced
Endpoint Recommender Endpoint Recommender + Inference Recommender
45 mins 2 hours

b) Endpoint Recommender vs Inference Recommender

Endpoint Recommender Inference Recommender


Output list (or ranking) of prospective Same
instances
run a set of load tests. based on a custom load test.
What you -N/A - your desired ML instances or a
need to do serverless endpoint, provide a
custom traffic pattern, and
provide requirements for
latency and throughput

c) How to start

SageMaker Inference Recommender

d) Sample Recommender output


4.2.3 Optmize Costs
Inference Recommender types

Option Description Best For Cost Savings Example Use Case

Spare EC2 capacity at lower Interruptible Up to 90% vs On- Data preprocessing or


Spot Instances
prices; can be interrupted workloads Demand batch processing

Short-term,
On-Demand Pay-per-use with no long-term Real-time inference
unpredictable None (baseline)
Instances commitment services
workloads

Steady-state,
Reserved Discounted rates for 1 or 3- Up to 72% vs On- Long-running ML
predictable
Instances year commitments Demand training jobs
workloads

Reserved capacity for AWS Ensuring capacity


ML workloads in on-
Capacity Blocks Outposts or Wavelength during peak Varies
premises environments
Zones demand

Commit to a specific
Savings Plans for Flexible, recurring Up to 64% vs On- Regular model training
compute usage for 1 or 3
SageMaker SageMaker usage Demand and deployment
years
4.3 Secure AWS ML Resources

4.3.1 Securing ML Resources


Access Control using IAM
a) Roles vs Policies

Category Type Description Key Responsibilities/Features

Data Scientist/ Provides access for


Access to S3, Athena , SageMaker Studio
ML Engineer experimentation

User Provides access for data


Data Engineer Access to S3, Athena, AWS Glue, EMR
Roles management

Provides access for ML Access to SageMaker, CodePipeline, CodeBuild,


MLOps Engineer
operations CloudFormation, ECR, Lambda, Step Functions

Allows SageMaker to
SageMaker
perform tasks on behalf of General SageMaker operations
Execution
users

Specific to SageMaker
Processing Job Data processing tasks
Service processing jobs
Roles
Specific to SageMaker
Training Job Model training tasks
training jobs

Specific to SageMaker
Model Model deployment and hosting
model deployment

Attached to IAM users,


Identity-based Define actions allowed on specific resources
groups, or roles
IAM
Policies Attached to resources (e.g.,
Resource-based Control who can access specific resources
S3 buckets)
IAM Policy – Examples for ML workflows

Resource
ID Purpose Key Permissions Notes
Scope

• SageMaker: CreateTrainingJob,
CreateModel
• S3: GetObject, PutObject
• ECR: BatchGetImage
Least privilege Adheres to
• CloudWatch: PutMetricData Specific ARNs
1 access for ML principle of least
for each service
workflow privilege

• machinelearning:Get*
• machinelearning:Describe*

Specific
Allows reading
MLModel ARNs
Read metadata metadata but not
2 for Get*<br>*
of ML resources modifying
(all) for
resources
Describe*

• machinelearning:CreateDataSourceFrom*
Cannot be
Create ML • machinelearning:CreateMLModel
3 * (all) restricted to
resources • machinelearning:CreateBatchPrediction
specific resources
• machinelearning:CreateEvaluation

Allows
Manage real- • machinelearning:CreateRealtimeEndpoint
Specific management of
4 time endpoints • machinelearning:DeleteRealtimeEndpoint
MLModel ARN endpoints for a
and predictions • machinelearning:Predict
specific model
Detailed examples

1. identity-based policy used in a machine learning use case

2. Allow users to read machine learning resources metadata

3. Allow users to create machine learning resources

4. Allow users to create /delete real-time endpoints and perform real-time predictions on an ML model
Detailed examples

To ensure access only from VPC, use VPC Endpoints for:

• S3
• CloudWatch Logs
• SageMaker runtime
• SageMaker API
4.3.3 SageMaker Compliance & Governance
AWS Services for Compliance and Governance
Service Purpose Key Features ML-Related Use Case

Provide on-demand access to • Self-service portal Access HIPAA compliance


AWS Artifact AWS compliance reports and • Access to compliance reports for healthcare ML
agreements documentation projects

• Continuous monitoring Monitor SageMaker resource


Monitor & Evaluate AWS
AWS Config • Automated configuration configurations for compliance
resource configurations
evaluation with security policies

Continuously audit AWS Streamlined auditing process, Assess compliance of ML


Audit
usage for risk and compliance against regulations and workflows with industry
Manager assessment standards standards

Security View of security alerts and Monitor security posture across


Centralized security alerts
Hub posture ML workflows and resources

Amazon Automated vulnerability Continuous scanning for Scan container images in ECR
Inspector management vulnerabilities for ML model deployments

Create catalogs of compliant


AWS Service Create and manage catalogs Governance-compliant
SageMaker resources and ML
Catalog of pre-approved resources resource catalogs
models

Amazon SageMaker Governance Tools Summary


Tool Purpose Key Features

SageMaker Role • Define minimum permissions for ML activities


Simplify access control
Manager • Quick setup & Streamlined access management

SageMaker Model Document and share • Record intended uses


Cards model information • Document risk ratings

SageMaker Model Provide overview of • Unified view of all models in account


Dashboard models • Monitor model behavior in production

Streamline ML asset • Publish ML and data assets


SageMaker Assets
management • Share assets across teams

• Protect data and workloads


Model Governance Ensure compliance and
• Ensure compliance with standards
and Explainability transparency
• Enhance model interpretability
Compliance certifications and regulatory Frameworks

Governance
Description AWS Services to Use
/Framework

• AWS Config
ISO 27001 Information Security Management System standard
• AWS Security Hub

• AWS Artifact
SOC 2 Service Organization Control for service organizations • AWS Config
• SageMaker Model Cards

• AWS Config
PCI-DSS Payment Card Industry Data Security Standard • AWS WAF
• Amazon Inspector

• AWS Artifact
HIPAA Health Insurance Portability and Accountability Act • AWS Security Hub
• AWS Config

• AWS CloudTrail
FedRAMP Federal Risk and Authorization Management Program
• AWS Config

Note: AWS Config common to all


4.3.3 Security Best Practices for CI/CD Pipelines
CI/CD pipeline stages
Best practice: When

CI/CD Stage Security Tools/Practices


• pre-commit hooks (scripts)
• IDE plugins to
Pre-Commit o analyze code, detect issues
o provide recommendations for improvements.
o handle linting, formatting, beautifying, and securing code.
Commit Static Application Security Testing (SAST),
Build Software Composition Analysis (SCA)
o identifies the open-source packages used in code
o defining vulnerabilities and potential compliance-based issues
o scan infrastructure as code (IaC) manifest files
Test • Dynamic Application Security Testing (DAST)
• Interactive Application Security Testing (IAST)
o Combine the advantages of SAST and DAST tools.
Deploy • Penetration testing
Monitor • Red/Blue/Purple teaming
4.3.4 Implement Security & Compliance w/ Monitoring, Logging and Auditing
CloudTrail for ML Resource Monitoring and Logging
Use Case Description Benefits

Compliance Generate audit trails using • Demonstrate compliance with regulations


Auditing CloudWatch Logs and CloudTrail • Meet internal policy requirements

Resource • Optimize ML workloads


Monitor resource utilization metrics
Optimization • Prevent resource abuse and DoS attacks

Investigate and respond to security • Identify unauthorized access attempts


Incident Response
incidents • Detect and respond to data breaches

Implement ML models to detect • Identify potential security threats


Anomaly Detection
unusual patterns • Detect deviations in monitoring data

SageMaker Security Troubleshooting and Debugging Summary


Tool/Feature Purpose Key Information Provided Use Case

• Caller identity
Identify unauthorized API calls to
CloudTrail Logs Monitor API calls • Timestamps
SageMaker resources
• API details

Data Event Monitor data Input/output data for training Verify if unauthorized entities
Logs plane operations and inference accessed model data

Permissions granted for


Manage access Identify overly permissive
IAM Policies SageMaker resources and
control policies, ensure least privilege
operations

Monitor network Network traffic to/from Identify suspicious IP addresses


VPC Flow Logs
traffic SageMaker resources or communication patterns

• Encryption status (at rest and in


Encryption Ensure data Verify proper data encryption and
transit)
Settings protection key management
• AWS KMS key configurations

AWS Enhance network Private connections between Ensure traffic remains within
PrivateLink security VPC and SageMaker AWS network
Domain X: Misc
X.1 SageMaker Deep Dive
X.1.1 Fully Managed Notebook Instances with Amazon SageMaker

Elastic Inference
Elastic Inference is a service that allows attaching a portion of a GPU to an existing EC2
instance.2 This approach is particularly useful when running inference locally on a notebook
instance.2 By selecting an appropriate Elastic Inference configuration based on size, version,
and bandwidth, users can accelerate their inference tasks without needing a full GPU.2

Use Cases for Elastic Inference

• You need to run inference tasks locally on your notebook instance.


• Your workload benefits from GPU acceleration but doesn't require a full GPU.
• You want to optimize cost by only paying for the portion of GPU resources used.
X.1.2 SageMaker Built-in Algorithms

Task Category Algorithms Supervised/UnSupervised

• Linear Learner (distributed) Supervised


• XGBoost
Classification
• KNN
• Factorization Machines

• Linear Learner Supervised


Regression • XGBoost
• KNN

• Object Detection (incremental) Supervised


Computer Vision
• Semantic Segmentation

Working with Text • BlazingText Supervised / Unsupervised

Sequence Translation • Seq2Seq (distributed) Supervised

• Factorization Machines (distributed) Supervised


Recommendation
• KNN

• Random Cut Forests (distributed) Unsupervised / Semi-


Anomaly Detection supervised
• IP Insights (distributed)

• LDA Unsupervised
Topic Modeling
• NTM

Forecasting • DeepAR (distributed) Supervised

• K-means (distributed) Unsupervised


Clustering
• KNN

• PCA Unsupervised / Semi-


Feature Reduction
• Object2Vec supervised
X.1.3 SageMaker Training types

Training Type Description When to Use

• Working on common ML tasks (e.g.,


Pre-configured algorithms provided by
1. Built-in classification, regression)
Amazon SageMaker, optimized for
Algorithms • When you need a quick start without deep ML
performance and ease of use
expertise

• You have existing scripts in popular ML


Custom training scripts using popular ML
frameworks
2. Script Mode frameworks (e.g., TensorFlow, PyTorch,
• For customizing model architecture while
Scikit-learn)
leveraging SageMaker's infrastructure

• Need complete control over training env.


3. Docker Custom Docker containers with your own
• Custom or proprietary algorithms
Container algorithms or environments
• For complex, multi-step training pipelines

Pre-built algorithms and models from • Need industry-specific or specialized models


4. AWS ML
third-party vendors available through the • When you want to explore alternative solutions
Marketplace
AWS Marketplace without building from scratch

Interactive development and training • During the initial stages of model development
5. Notebook
using Jupyter notebooks on managed • When you need an interactive environment for
Instance instances debugging and visualization

Key Considerations:

• Skill Level: Built-in Algorithms and Marketplace for beginners, Script Mode and Containers
for more advanced users
• Customization Needs: From low (Built-in) to high (Containers)
• Development Speed: Notebooks for rapid prototyping, Built-in for quick deployment,
Containers for complex but reproducible setups
• Scale: Consider moving from Notebooks to other options as your data and model
complexity grow.
X.1.4 Train Your ML Models with Amazon SageMaker
Splitting Data for ML
X.1.5 Tuning Your ML Models with Amazon SageMaker
Maximizing Efficiency across tuning jobs

X.1.6 Tuning Your ML Models with Amazon SageMaker

How to automate

Put a check to see if Accuracy falls below a % (e.g. > 80%), invoke Human in the loop
X.1.6 Add Debugger to Training Jobs in Amazon SageMaker

How it works
1. Add debugging hook:
o An EC2 instance with an attached EBS volume is used to initiate the process.
o The debugging hook is added to the training job configuration.
2. Hook listens to events and records tensors:
o Docker containers running on EC2 instances are used for the training job.
o The hook listens for specific events during the training process and records tensor data.
3. Debugger applies rules to tensors:
o Another EC2 instance with a Docker container is used for debugging.
o The debugger applies predefined rules (mentioned as "x15!!" in the image) to the recorded
tensor data.

Benefits of debugger

1. Comprehensive Built-in Rules/Algorithms: The debugger offers a wide range of built-in rules to
detect common issues in machine learning models, such as:

o DeadRelu, ExplodingTensor, PoorWeightInitialization


o SaturatedActivation, VanishingGradient
o WeightUpdateRatio, AllZero, ClassImbalance
o Confusion, LossNotDecreasing, Overfit
o Overtraining, SimilarAcrossRuns
o TensorVariance, UnchangedTensor
o CheckInputImages, NLPSequenceRatio, TreeDepth
2. Customizable (BYO - Bring Your Own): Users can create and add their own custom debugging
rules.

3. Easy Integration: The entry point is 'mnist.py' and it works with SageMaker's built-in algorithms
(1P SM algos), suggesting easy integration with existing SageMaker workflows.

4. No Code Changes Required: The "No Change Needed" text implies that adding debugging
capabilities doesn't require modifying the existing model code.

5. Visualization: The debugger provides visualization capabilities, as indicated by the image on


the right, which appears to show a tensor or weight distribution.

6. Real-time Monitoring: The variety of rules suggests that the debugger can monitor various
aspects of model training in real-time, helping to identify issues as they occur.

X.1.7 Deployment using SageMaker

Deployment
Description When to Use
Strategy

• When you need fine-grained control over


the traffic shift
Blue/Green Gradually shift traffic from the old
• For critical applications requiring minimal
Deployment with version (blue) to the new version (green)
risk
Linear Traffic Shifting over time
• When you have the resources to run two
full environments simultaneously

• When you want to test in production with


Release a new version to a small subset real users
Canary Deployment of users before rolling it out to the entire • For early detection of issues before full
infrastructure deployment
• When you have a diverse user base

• When you want to test specific features or


Run two versions simultaneously and
changes
A/B Testing compare their performance based on
• When you need to optimize based on user
metrics
behavior or business metrics

• When you have limited resources and


can't run two full environments
Gradually replace instances of the old
Rolling Deployment • For applications that can handle mixed
version with the new version
versions
• When you need to minimize downtime

You might also like