AWS Training Notes - Summary
AWS Training Notes - Summary
Contents
AWS Enhanced Prep plan ....................................................................................................... 1
AWS ML Engineer Associate Curriculum Overview ................................................................... 3
Domain 1: Data Processing .................................................................................................... 4
1.1 Collect, Ingest, and Store Data .............................................................................................. 4
1.1.1 COLLECT DATA ................................................................................................................ 4
1.1.2 STORE DATA .................................................................................................................... 9
1.1.3 Data Ingestion ............................................................................................................... 14
1.1.4 Summary ...................................................................................................................... 20
1.2 Transform Data (Data Cleaning, Categorical encoding, Feature Engineering) ........................... 23
1.2.1 Data Cleaning ............................................................................................................... 23
1.2.2 Categorical encoding .................................................................................................... 26
1.2.3 Feature Engineering ...................................................................................................... 27
X. AWS Tools for Data Transformation ....................................................................................... 31
X.1. Data Labeling with AWS .................................................................................................. 31
X.2. Data Ingestion with AWS ................................................................................................. 32
X.3. Data Transformation with AWS ....................................................................................... 33
1.3 Validate Data and Prepare for Modeling .............................................................................. 35
1.3.1 VALIDATE DATA.............................................................................................................. 35
1.3.2 PREPARE FOR MODELLING ........................................................................................... 37
Domain 2: Data Transformation ............................................................................................. 40
2.1 Choose a modelling approach............................................................................................. 41
2.1.1 AWS Model Approaches ................................................................................................ 41
2.1.1 SageMaker Offerings ..................................................................................................... 41
2.1.1 SageMaker Model types ................................................................................................ 42
2.1.3 SageMake AutoML......................................................................................................... 44
2.1.3 SageMake JumpStart .................................................................................................... 45
2.1.5 Bedrock......................................................................................................................... 47
2.2 Train Models ....................................................................................................................... 48
2.2.1 Model Training Concepts .............................................................................................. 48
2.2.2 Compute Environment .................................................................................................. 50
2.2.3 Train a model ................................................................................................................ 51
2.3 Refine Models...................................................................................................................... 56
2.3.1 Evaluating Model Performance ..................................................................................... 56
2.3.2 Model Fit (Overfitting and Underfitting) ......................................................................... 57
2.3.3 Hyperparameter Tuning ................................................................................................ 61
2.3.4 Managing Model Size .................................................................................................... 64
2.3.5 Refining Pre-trained models ......................................................................................... 65
2.3.6 Model Versioning .......................................................................................................... 67
2.4 Analyze Model Performance................................................................................................ 68
2.4.1 Model Evaluation .......................................................................................................... 68
Domain 3: Selecta a deployment infrastructure ......................................................................72
3.1 Select a Deployment Infrastructure .................................................................................... 73
3.1.1 Model building & Deployment Infra ............................................................................... 73
3.1.2 Inference Infrastructure................................................................................................ 75
3.2 Create and Script Infrastructure ......................................................................................... 79
3.2.1 Methods for Provisioning Resources............................................................................. 80
3.2.2 Deploying and Hosting Models ..................................................................................... 85
3.3 Automate Deployment ........................................................................................................ 91
3.3.1 Introduction to DevOps ................................................................................................. 91
3.3.2 CI/CD: Applying DevOps to MLOps ............................................................................... 92
3.3.3 AWS Software Release Processes ................................................................................ 95
3.3.4 Retraining models ......................................................................................................... 99
Domain 4: Monitor Model .................................................................................................... 101
4.1 Monitor Model Performance and Data Quality .................................................................. 102
4.1.1 Monitoring Machine Learning Solutions ...................................................................... 102
4.1.2 Remediating Problems Identified by Monitoring ......................................................... 111
4.2 Monitor and Optimize Infrastructure and Costs ................................................................ 112
4.2.1 Monitor Infrastructure................................................................................................. 112
4.2.2 Optmize Infrastructure ............................................................................................... 114
4.2.3 Optmize Costs ............................................................................................................ 115
4.3 Secure AWS ML Resources ................................................................................................ 116
4.3.1 Securing ML Resources............................................................................................... 116
4.3.3 SageMaker Compliance & Governance ...................................................................... 120
4.3.3 Security Best Practices for CI/CD Pipelines ............................................................... 122
4.3.4 Implement Security & Compliance w/ Monitoring, Logging and Auditing ................... 123
Domain X: Misc .................................................................................................................. 124
X.1 SageMaker Deep Dive ......................................................................................................... 125
X.1.1 Fully Managed Notebook Instances with Amazon SageMaker .................................... 125
X.1.2 SageMaker Built-in Algorithms ................................................................................... 126
X.1.3 SageMaker Training types ........................................................................................... 127
X.1.4 Train Your ML Models with Amazon SageMaker .......................................................... 128
X.1.5 Tuning Your ML Models with Amazon SageMaker........................................................ 129
X.1.6 Tuning Your ML Models with Amazon SageMaker........................................................ 129
X.1.6 Add Debugger to Training Jobs in Amazon SageMaker ................................................ 130
X.1.7 Deployment using SageMaker .................................................................................... 131
• Canvas
• Pipeline
• Model/Model Registry)
• Jumpstart
SageMaker Documentation
• Feature Store
• AutoML
• Studio
• Jupyter notebook
Domain 1: Data Processing
Best practice: When building an ML model, it's important to feed it high-quality data that
accurately reflects the real world. For example, if 20 percent of customers typically cancel
memberships after a year, the data should represent that churn rate. Otherwise, the model
could falsely predict that significantly more or fewer customers will cancel.
Watch for: If your data doesn't actually reflect the real-world scenarios that you want your
model to handle, it will be difficult to identify meaningful patterns and make accurate
predictions.
• RELEVANT
Best practice: Data should contain relevant attributes that expose patterns related to what
you want to predict, such as membership duration for predicting cancellation rate.
Watch for: If irrelevant information is mixed in with useful data, it can impact the model's
ability to focus on what really matters. For example, a list of customer emails in a dataset
that's supposed to predict membership cancellation can negatively impact the model's
performance.
• Feature Rich
Best practice: Data should include a complete set of features that can help the model
learn underlying patterns. You can identify additional trends or patterns to increase
accuracy by including as much relevant data as possible.
Watch for: Data that has limited features reduces the ability of the ML algorithm to
accurately predict customer churn. For example, if the data consists of a small set of
customer details and omits important data, like demographic information, it will lose
accuracy and miss opportunities for detecting patterns in cancellation rate.
• Consistent
Best practice: Data must be consistent when it comes to attributes, such as features and
formatting. Consistent data provides more accurate and reliable results.
Watch for: If the datasets come from various data sources that contain different formatting
methods or metadata, the inconsistencies will impact the algorithm's ability to effectively
process the data. The algorithm will be less accurate with the inconsistent data.
Types of Data
Text
Text data, such as documents and website content, is converted to numbers for use in ML models, especially for
natural language processing (NLP) tasks like sentiment analysis. Models use this numerical representation of
text to analyze the data.
Tabular
Tabular data refers to information that is organized into a table structure with rows and columns, such as the data
in spreadsheets and databases. Tabular data is ideal for linear regression models.
Time series
Time-series data is collected over time with an inherent ordering that is associated with data points. It can be
associated with sensor, weather, or financial data, such as stock prices. It is frequently used to detect trends. For
instance, you might analyze and forecast changes using ML models to make predictions based on historical data
patterns.
Image
Image data refers to the actual pixel values that make up a digital image. It is the raw data that represents the
colors and intensities of each pixel in the image. Image data, like data from photos, videos, and medical scans, is
frequently used in machine learning for object recognition, autonomous driving, and image classification.
Formatting data
• Structured
• Unstructured
• Semi-structured
Data formats and file types
1. Row-based data format
o common in relational databases and spreadsheets.
o It shows the relationships between features
• CSV
Comma-separated values (CSV) files are lightweight, space-efficient text files that represent tabular
data. Each line is a row of data, and the column values are separated by commas. The simple CSV
format can store different data types like text and numbers, which makes it often used for ML data.
However, the simplicity of CSV comes at the cost of performance and efficiency compared to columnar
data formats with more optimized for analytics.
• Avro RecordIO
Avro RecordIO is a row-based data storage format that stores records sequentially. This sequential
storage benefits ML workloads that need to iterate over the full dataset multiple times during model
training. Additionally, Avro RecordIO defines a schema that structures the data. This schema improves
data processing speeds and provides better data management compared to schema-less formats.
o Parquet
Parquet is a columnar storage format typically used in analytics and data warehouse workloads that
involve large data sets. ML workloads benefit from columnar storage because data can be compressed,
which improves both storage space and performance.
o ORC
Optimized row columnar (ORC) is a columnar data format similar to Parquet. ORC is typically used in big
data workloads, such as Apache Hive and Spark. With the columnar format, you can efficiently
compress data and improve performance. These performance benefits make ORC a widely chosen data
format for ML workloads.
3. Object-notation data
o Object notation fits non-tabular, hierarchical data, such as graphs or textual data.
o Object-notation data is structured into hierarchical objects with features and key-value pairs rather
than rows and columns.
• JSON
JavaScript Object Notation (JSON) is a document-based data format that is both human and machine
readable. ML models can learn from JSON because it has a flexible data structure. The data is compact,
hierarchical, and easy to parse, which makes it suitable for many ML workloads.
An object is data defined by key-value pairs and An array is a collection of values enclosed in
enclosed in braces {}. The data can be a string, square brackets [ ] and can contain values that
number, Boolean, array, object, or null are separated by commas. The following array
consists of multiple objects.
• JSONL
JavaScript Object Notation Lines (JSONL) is also called newline-delimited JSON. It is a format for
encoding JSON objects that are separated by new lines instead of being nested. Each JSON object is
written on its own line, such as in the following example.
JSONL improves efficiency because individual objects can be processed without loading a larger JSON
array. This improved efficiency when parsing objects results in better handling of large datasets for ML
workloads. Additionally, JSONL structure can map to columnar formats like Parquet, which provides the
additional benefits of those file types.
1.1.4 Graphs for data visualization
Categorical data
Numerical Data
Features: S3 serves as a central data lake for ingesting, extracting, and transforming data to and from other
AWS services used for processing tasks. These tasks are an integral part of most ML workloads. The ability to
store and retrieve data from anywhere makes Amazon S3 a key component in workflows requiring scalable,
durable, and secure data storage and management.
Considerations: S3 provides scalability, durability, and low cost, but it has higher latency compared to local
storage. For latency-sensitive workloads, S3 might not be optimal. When deciding if S3 meets your needs,
weigh its benefits against potential higher latency. With caching and proper architecture, many applications
achieve excellent performance with S3, but you must account for its network-based nature.
Use cases
Features: Amazon EBS is well-suited for databases, web applications, analytics, and ML workloads. The
service integrates with Amazon SageMaker as a core component for ML model training and deployment. By
attaching EBS volumes directly to Amazon EC2 instances, you can optimize storage for ML and other data-
intensive workloads.
Considerations: EBS separates storage from EC2 instances, requiring more planning to allocate and scale
volumes across instances. Instance stores simplify storage by tying storage directly to the EC2 instance
lifecycle. This helps to avoid separate volume management. Although EBS offers flexibility, instance stores
provide streamlined, intrinsic storage management than EBS.
Use cases
• High-performance storage
EBS provides high-performance storage for ML applications requiring fast access to large datasets. EBS
offers volumes with high IOPS for quick data reads and writes. The high throughput and IOPS accelerate
ML workflows and applications.
• Host pre-trained models
With EBS, you can upload, store, and access pre-trained ML models to generate real-time predictions
without setting up separate hosting infrastructure.
3. EFS
Features:
• The service is designed to grow and shrink automatically as files are added or removed, so performance
remains high even as file system usage changes.
• EFS uses the NFSv4 networking protocol to allow compute instances access to the file system across a
standard file system interface. You can conveniently migrate existing applications relying upon on-
premises NFS servers to Amazon EFS without any code changes.
Considerations: EFS has higher pricing, but offers streamlined scaling of shared file systems. EBS offers
lower costs, but there are potential performance limitations based on workload. Consider if the higher EFS
costs and potential performance variability are acceptable trade-offs compared to potentially lower EBS
costs but workload-dependent performance excellent performance with S3, but you must account for its
network-based nature.
Use cases
• Concurrent access
EFS allows multiple EC2 instances to access the same datasets simultaneously. This concurrent access
makes Amazon EFS well-suited for ML workflows that require shared datasets across multiple compute
instances.
• Shared datasets
EFS provides a scalable, shared file system in the cloud that eliminates the need for you to copy large
datasets to each compute instance. Multiple instances can access data, such as ML learning libraries,
frameworks, and models, simultaneously without contention. This feature contributes to faster model
training and deployment of ML applications.
4. Amazon FSx
Features:
• Amazon FSx offers a rich set of features focused on reliability, security, and scalability to support ML,
analytics, and high-performance computing applications.
• The service delivers millions of IOPS with sub-millisecond latency so you can build high-performance
applications that require a scalable and durable file system.
Considerations: When using Amazon FSx for ML workloads, consider potential tradeoffs. Certain file
system types and workloads can increase complexity and management needs. Tightly coupling the ML
workflow to a specific file system also risks vendor lock-in, limiting future flexibility.
Use cases
• Distributed architecture
Lustre's distributed architecture provides highly parallel and scalable data access, making it ideal for
hosting large, high-throughput datasets used for ML model training. By managing infrastructure
operations, including backups, scaling, high availability, and security, you can focus on your data and
applications rather than infrastructure management.
Model output Storage Options
1. Training Workloads
Training workloads require high performance and frequent random I/O access to data.
• EBS volumes are well-suited for providing the random IOPS that training workloads need.
Additionally, Amazon EBS instance store volumes offer extremely low-latency data access. This is
because data is stored directly on the instances themselves rather than on network-attached
volumes.
2. Inference Workloads
Need fast response times for delivering predictions, but usually don't require high I/O performance, except
for real-time inference cases.
• EBS gp3 volumes or EFS storage options are well-suited for meeting these needs.
• For increased low-latency demands, upgrading to EBS io2 volumes can provide improved low-
latency capabilities.
3. Real-time and streaming workloads
• EFS file systems allow low latency and concurrent data access for real-time and streaming
workloads. By sharing the same dataset across multiple EC2 instances, EFS provides high
throughput access that meets the needs of applications requiring real-time data sharing.
4. Dataset storage
• S3 can be used for storing large datasets that do not need quick access, such as pretrained ML
models, or data that is static or meant for archival purposes.
Data Access Patterns
There are three common data access patterns in ML: copy and load, sequential streaming, and
randomized access.
Copy and load Sequential streaming Randomized access
Data is copied from S3 to a training Data is streamed to instances as Data is randomly accessed, such
instance backed by EBS. batches or individual records, as with a shared file system data
typically from S3 to instances store, like FSx and EFS.
backed by EBS volumes.
Cost
Cost comparison
• S3 has the lowest cost for each gigabyte of storage based on storage classes. Storage classes are priced
for each gigabyte, frequency of access, durability levels, and for each request.
• EBS has network attached storage, which is more expensive per gigabyte than Amazon S3. However, it
provides lower latency, storage snapshots, and additional performance features that might be useful for
ML workloads.
• EFS is a managed file service with increased costs that can link multiple instances to a shared dataset.
Cost structure is designed around read and write access and the amount of gigabyte used with different
storage tiers available.
• FSx pricing depends on the file system used. General price structure is around storage type used for
each gigabyte, throughput capacity provisioned, and requests.
• AWS Cost Explorer – See patterns in AWS spending over time, project future costs, identify areas that
need further inquiry, observe Reserved Instance utilization, observe Reserved Instance coverage, and
receive Reserved Instance recommendations.
• AWS Trusted Advisor – Get real-time identification of potential areas for optimization.
• AWS Budgets – Set custom budgets that trigger alerts when cost or usage exceed (or are forecasted to
exceed) a budgeted amount. Budgets can be set based on tags and accounts as well as resource types.
• CloudWatch – Collect and track metrics, monitor log files, set alarms, and automatically react to
changes in AWS resources.
• AWS CloudTrail – Log, continuously monitor, and retain account activity related to actions across AWS
infrastructure at low cost.
• S3 Analytics – Automated analysis and visualization of S3 storage patterns to help you decide when to
shift data to a different storage class.
• AWS Cost and Usage Report – Granular raw data files detailing your hourly AWS usage across accounts
used for Do-It-Yourself (DIY) analysis (e.g., determining which S3 bucket is driving data transfer spend).
The AWS Cost and Usage Report has dynamic columns that populate depending on the services you
use.
1.1.3 Data Ingestion
Realtime Ingestion - streaming services
Amazon Kinesis vs MSK vs Firehose
• Kinesis Data Streams is primarily used for ingesting and processing data.
• Firehose provides a streamlined method of streaming data to data storage locations.
• Amazon Managed Service for Apache Flink provides consumption of streaming data
using Apache Kafka in real-time for analysis.
Amazon S3 Transfer Acceleration uses CloudFront edge locations to accelerate large data transfers to and from
S3. These transfers can help speed up data collection for ML workloads that require moving large datasets. S3
Transfer Acceleration overcomes bottlenecks like internet bandwidth and distance that can limit transfer speeds
when working with large amounts of data.
• DMS
AWS Data Migration Service (AWS DMS) facilitates database migration between databases or to Amazon S3 by
extracting data in various formats, such as SQL, JSON, CSV, and XML. Migrations can run on schedules or in
response to events for frequent data extraction. With AWS DMS, you can migrate databases between many
sources and targets.
• AWS DataSync
With AWS DataSync, you can efficiently transfer data between on-premises systems or AWS services by
extracting data from sources, such as data file systems or network-attached storage. You can then upload data
to AWS services like Amazon S3, Amazon EFS, Amazon FSx, or Amazon RDS on a scheduled or event-driven
basis. DataSync facilitates moving large datasets to the cloud while reducing network costs and data transfer
times.
• AWS Snowball
AWS Snowball is a physical device service used to transfer large amounts of data into and out of AWS when
network transfers are infeasible. Snowball devices efficiently and cost-effectively move terabytes or petabytes of
data into S3
Storage
• S3
With S3 serving as a highly scalable object storage service, data used for ML projects can be spread out across
storage locations. Storage can be extracted and transferred to and from S3 with other AWS services. These other
services include Amazon S3 Transfer Accelerator, AWS CLI, AWS SDK, AWS Snowball, AWS DataSync, AWS DMS,
AWS Glue, and AWS Lambda.
• EBS
EBS volumes provide storage for ML data. This data can be copied to services such as Amazon S3 or Amazon
SageMaker, using tools like the AWS Management Console, AWS CLI, or AWS SDKs to manage volumes. EBS
volumes store the necessary data that is then extracted and moved to other AWS services to meet ML
requirements.
• EFS
Amazon EFS allows creating shared file systems that can be accessed from multiple EC2 instances, so you can
share data across compute resources. You can extract data from EFS using AWS CLI, AWS SDKs, or with services
like AWS Transfer Family and DataSync that facilitate data transfers. Amazon EFS provides the capability to share
data from Amazon EC2 instances while also providing tools to conveniently move the data to other services.
• RDS
Amazon Relational Database Service (Amazon RDS) provides relational databases that can be accessed through
AWS services like AWS DMS, the AWS CLI, and AWS SDKs to extract and transfer data. Amazon RDS is a common
source for extracting relational data because it offers managed database instances that streamline data access.
• DynamoDB
Amazon DynamoDB is a fully managed NoSQL database service provided by AWS. You can extract data using
various AWS services like AWS DMS, AWS Glue, and AWS Lambda. You can use programmatic tools, such as the
AWS CLI and AWS SDK, to process and analyze the data outside of DynamoDB. Data extraction allows
DynamoDB data to be integrated with other platforms for further processing.
Data Merging
1. AWS Glue is a fully managed ETL service that you can use to prepare data for analytics and machine
learning workflows.
Best for: Glue works for ETL workloads from varied data sources into data lakes like Amazon S3.
Steps
a) Identify: Identify data sources: AWS Glue can be used to combine or transform large datasets using
Apache Spark. It can efficiently process large structured and unstructured datasets in parallel. AWS
Glue integrates with services like S3, Redshift, Athena, or other JDBC compliant data stores.
b) Create an AWS Glue crawler: AWS Glue crawlers scan data and populate the AWS Glue Catalog.
c) Generate ETL scripts and define jobs: Jobs run the ETL scripts to extract, transform, and load the data,
which can start on demand or can be scheduled to run at specific intervals.
d) Clean and transformed data is written back to S3 or to another data store, such as Amazon Redshift.
2. Amazon EMR
Amazon EMR is a service for processing and analyzing large datasets using open-source tools of big data
analysis, such as Apache Spark and Apache Hadoop. It applies ETL methodologies to ensure the product is
flexible and scalable. Amazon EMR integrates data from multiple sources into one refined platform, making
the transformation of data cost-effective and quick.
Best for: Amazon EMR is best suited for processing huge datasets in the petabyte range.
STEPS
o Ingest streaming sources: ETL is done using Apache Spark Streaming APIs. This makes it possible to
source data in real time from places such as Apache Kafka and Amazon Kinesis. Data is received and
combined in real time.
o Distribute across EMR cluster: EMR clusters are made up of various nodes, each of which are
configured specifically to handle parallel tasks and data processing.
o Generate ETL scripts and define jobs: At the end of the data processing lifecycle, you can use the
Python, Scala, or SQL development environments. These environments give you powerful, flexible
methods for scripting data workflows and for making data filtering, transformation, and aggregation
more convenient.
o Output to Amazon S3: After processing and transforming the data, the results are outputted in an
Amazon S3 bucket.
Amazon EMR is a service for processing and analyzing large datasets using open-source tools of big data
analysis, such as Apache Spark and Apache Hadoop. It applies ETL methodologies to ensure the product is
flexible and scalable. Amazon EMR integrates data from multiple sources into one refined platform, making
the transformation of data cost-effective and quick.
Purpose Serverless ETL service Big data processing platform ML-focused data preparation
Ideal Vol. • Medium to large datasets • Ideal for very large datasets • Small to medium datasets
ML Can prepare data for ML, but Can run ML frameworks, but Tightly integrated with SageMaker
Integration not specialized requires setup ML workflow
With EFS, FSx, and S3, you can seamlessly scale storage increasing or decreasing in size.
High latency, insufficient IOPs, or slow data transfer times significantly impact performance of storage
systems and data ingestion. Network bottlenecks, undersized provisioned storage volumes, or using
inefficient methods of ingestion can arise from these factors.
Consider optimizing network configurations or use AWS services with improved performance
capabilities, such as provisioned IOPS volumes for EBS. Using techniques, such as compression or
batching, can also lead to improved data-transfer efficiency.
Hotspots or bottlenecks in storage systems can be caused by uneven distribution of data access
resulting in performance degradation or data availability issues.
AWS services, such as S3 and EFS, automatically distribute data and provide load balancing
capabilities. Data partitioning is another strategy that can be implemented, which distributes data
across multiple storage resources, reducing the likelihood of hotspots.
Ingestion modifications
DO THE ASSESSMENT!!
1.1.4 Summary
Keywords/Concepts AWS Service/Option
• EBS gp3
Inference workloads (standard)
• EFS
Data Visualization
See Above
Amazon S3 Scalable, durable, central data lake Higher latency compared to local storage
Amazon High IOPS, suited for databases and Limited to single EC2 instance, requires volume
EBS ML training management
Amazon Lowest latency, high-performance Potential vendor lock-in, complex management for some
FSx computing file system types
Data Access Patterns
Randomized Access EFS, FSx Random data access, shared file system
• EBS (io2)
Training Workloads High IOPS, low-latency random access
• Instance Store
• EBS (gp3)
Inference Workloads (standard) Balance of performance and cost
• EFS
Inference Workloads (low-latency) EBS (io2) Higher IOPS for faster response times
• EFS
Distributed Processing Concurrent access from multiple instances
• FSx
1.2 Transform Data (Data Cleaning, Categorical encoding, Feature Engineering)
Remember
• Data cleaning focuses on handling issues like missing data and outliers.
• Categorical encoding - Used to convert values into numeric representations.
• Feature engineering focuses on modifying or creating new features from the data, rather than
encoding features.
Mean Median
The mean is the average of all the values in the The median is the value in the dataset that
dataset. Mean can be a useful method for divides the values into two equal halves.
understanding your data when the data is If your data is skewed or contains outliers, the
symmetrical. median tends to provide the better metric for
understanding your data as it relates to central
tendency.
For example, a symmetrical set of data that For example, a dataset that contains income
contains ages of respondents might reflect that level might contain outliers. The mean might
both the mean and median of the dataset is 50 skew toward higher or lower values, while the
years old. median would provide a more accurate picture
of the data's central tendency.
Artificial outlier - This data is in the correct format for the Age column, but an entry of 154 is unrealistic. In this
case, it makes the most sense to delete this entry from your data.
Natural outlier - Although three million dollars a year is drastically more than the rest of the salaries in our
dataset, this number is still plausible. In this case, you can choose to transform the outlier and reduce the outlier's
influence on the overall dataset. You will learn about that type of data transformation later in this course.
Incomplete and Missing Data
There are some key steps you can take to address incomplete and missing values in your dataset.
There are certain Python libraries, such as Pandas, that you can use to check for missing values.
Before you can determine how to treat the missing values, it’s important to investigate which mechanisms
caused the missing values. The following are three common types of missing data:
• Missing at Random (MAR): The probability that a data point is missing depends only on the observed data,
not the missing data.
Example: In a dataset of student test scores, scores are missing for some students who were
absent that day. Absence is related to performance.
• Missing Completely at Random (MCAR): The probability that a data point is missing does not depend on
the observed or unobserved data.
Example: In an employee survey, some people forgot to answer the question about their number of
siblings. Their missing sibling data does not depend on any values.
• Missing Not at Random (MNAR): The probability that a data point is missing depends on the missing data
itself.
Example: In a financial audit, companies with accounting irregularities are less likely to provide
complete records. The missing data depends on the sensitive information being withheld.
Depending on what is causing your missing values, you will decide to either drop the missing values or impute
data into your dataset.
One of the most straightforward ways to deal with missing values is to remove the rows of data with missing
values. You can accomplish this by using a Pandas function. Dropping rows or columns will make the dataset
non-missing. However, the risk of dropping rows and columns is significant.
Issues
o If you have hundreds of rows or columns of data, all of that missing data might cause bias in your
model predictions.
o If you drop too much data, you might not have enough features to feed the model.
• Impute values
Missing values might be related to new features that haven’t included your dataset yet. After you include more
data, those missing values might be highly correlated with the new feature. In this case, you would deal with
missing values by adding more new features to the dataset. If you determine the values are missing at random,
data imputation, or inputting the data into your dataset, is most likely the best option.
One common way to impute missing values is to replace the value with the mean, median, or most frequent
value. You would select the mean, median, or most frequent value for categorical variables. You would select the
mean or median for numerical variables. Choosing the mean, median, or most frequent value depending on your
business problem and data collection procedures.
1.2.2 Categorical encoding
Categorical encoding is the process of manipulating text-based variables into number-based variables.
When to encode
Not all categorical variables need to be encoded. Depending on your use case, different ML algorithms might not
require you to encode your variables.
For instance, a random forest model can handle categorical features directly. You would not need to encode
values, such as teal, green, and blue, as numeric values.
• Nominal, or multi-categorical, values are category values where order does not matter. A set of data that
contains different geographic locations, such as states or cities, might be considered multi-categorical.
• Ordinal, or ordered, values are category values where the order does matter, like what size of drink you order
at a coffee shop: small, medium, or large.
Encode Techniques
Not all categorical variables
• One Hot encoding : creating a new binary feature for each unique category value. Rather than assigning
a different value for each data point like binary encoding, one-hot encoding sets the feature to 1 or 0
depending on if the category that applies to a given data point.
When to use which: Label coding might not be the best technique if there are a lot of categories. In One
hot encoding, these additional columns might grow your dataset so much that it makes it difficult to
analyze efficiently.
1.2.3 Feature Engineering
Feature engineering is a method for transforming raw data into more informative features that help models better
capture underlying relationships in the data.
Feature Engineering by data type (numeric, text, image, and sound data types)
We only cover numeric and text here
Numeric feature engineering involves transforming numeric values for the model and is often accomplished by
grouping different numeric values together.
Text feature engineering involves transforming text for the model and is often accomplished by splitting the text
into smaller pieces.
1. Numerical Feature Engineering
• Purpose: Aims to transform the numeric values so that all values are on the same scale.
• Why: This method helps you to take large numbers and scale them down, so that the ML algorithm can
achieve quicker computations and avoid skewed results.
a) Feature Scaling:
Normalization Standardization
rescales the values (often between 0 and 1) Similar, but mean of 0 and standard deviation of 1
When to use: Reduces the negative effect of outliers.
b) Binning:
The data is divided into these bins based on value ranges, thus transforming a numeric feature into a
categorical one.
When: numerical data when the exact difference in numbers is not important, but the general range is a
factor.
c) Log transformation:
The most common logarithmic functions have bases of 10 or e, where e is approximately equal to
2.71828. Logarithmic functions are the inverse of exponential functions and are useful for modeling
phenomena where growth decreases over time, like population growth or decay.
When: When skewed numeric data, or multiple outliers. Basically, log compresses data to use lower
numbers
For example, the log of $10,000 would be around 4 and the log of $10,000,000 would be around 7. Using
this method, the outliers are brought much closer to the normal values in the remainder of the dataset.
2. Text Engineering
a) Bag of Words: The bag-of-words model does not keep track of the sequence of words, but counts the
number of words in each observation. Bag-of-words uses tokenization to create a statistical
representation of the text.
c) Temporal data: Temporal or time series data is data involving time, like a series of dates. Temporal data
can come in many formats, and is often a mix of numeric and text data.
Principal component – Size{|: The first component accounts for the physical
attributes of the home that includes square footage, bedrooms, and bathrooms
Purpose:
o Image annotation:
o Text annotation:
o Data collection:
o Data cleanup:
o Transcription:
Uses Mechanical Turk and other data processing methods to streamline the data preparation process
even further.
Purpose
o Image annotation:
o Text annotation:
o Object Detection
o Named Entity Recognition
Mechanical Turk
On-demand access to workers, lower costs, fast turnaround times, task flexibility, and
quality control capabilities.
SageMaker Ground Truth
Higher quality, or Object detection or NER, public sourcing
SageMaker Ground Truth Plus
Production labeling workflows, sensitive data, complex tasks, and custom interfaces
X.2. Data Ingestion with AWS
1. Data Wrangler
Purpose: visual, code-free tool for data preprocessing and feature engineering
Steps
o Clean data:
o Feature engineering: Combine columns and apply formulas, etc.
o Fixing formatting issues: Missing headers, encoding problems, etc.
o Reducing data size: For large datasets
o Automating transformations:
What:
Steps
Purpose:
• AWS Glue auto-generates Python code to handle issues like distributed processing, scheduling, and
integration with data sources.
• AWS Glue DataBrew is visual data preparation tool for cleaning, shaping, and normalizing datasets.
Use cases
o Automated ETL pipelines
o Data integration and ingestion:
o Data cleansing and standardization
o Feature engineering:
o Final pretraining data preparation
Steps
Purpose:
Use Cases
Steps
When to use:
• Data Wrangler: Exploratory data analysis, quick transformations, ML data prep in SageMaker
3. For Streaming data
Difference in proportion of
Class imbalance (CI)
labels (DPL)
Occurs when distribution of classes in the training Compare the distribution of labels
data is skewed (one class significantly less in data
represented than the others).
Understand If CI +ve, advantaged group is relatively If DPL +ve, one class significantly
overrepresented in this dataset. higher proportion.
If CI -ve, advantaged group is relatively If DPL -ve, one class significantly
underrepresented in this dataset. less proportion.
• Create a bias report using the configuration for pre-training and post-training analysis
• Assess the bias report by considering the class imbalance (CI) and difference in proportion of labels
(DPL).
1. Set up the bias report: To set up
your bias report configurations, use
the BiasConfig to provide
information on which columns
contain the facets with the sensitive
group of sex, what the sensitive
features might be using
facets_values_or_threshold, and
what the desirable outcomes are
using labels_values_or_threshold.
2. Run the bias report: create the bias
report using the configuration for
pretraining and post-training
analysis. This step takes
approximately 15-20 minutes
Best for • Easy implementation • Uses the entire dataset for training and
• Provides quick estimate of model testing, maximizing data usage
performance • Reduces variance in performance
estimation by averaging results across
multiple iterations
Limitations • Performance estimate might have • Computationally more expensive,
variance due to dependency on especially for large datasets
specific examples in the test set • Might be sensitive to class imbalances if
• Not suitable for small datasets not stratified properly
because it might lead to overfitting or
underfitting
Example • K-fold cross-validation
Dataset shuffling
Benefits of data shuffling: Dataset shuffling plays a crucial role in mitigating biases that might arise from the
inherent structure of the data. By introducing randomness through shuffling, you can help the model be
exposed to a diverse range of examples during training.
Data augmentation works by creating new, realistic training examples that expand the model's
understanding of the data distribution. Dataset augmentation involves artificially expanding the size and
diversity of a dataset
• Image-based Augmentation
o Flipping, rotating, scaling, or shearing images
o Adding noise or applying color jittering
o Mixing or blending images to create new, synthetic examples
• Text-based Augmentation
o Replacing words with synonyms or antonyms
o Randomly inserting, deleting, or swapping words
o Paraphrasing or translating text to different languages
o Using pre-trained language models to generate new, contextually relevant text
• Time series Augmentation
o Warping or scaling the time axis
o Introducing noise or jitter to the signal
o Mixing or concatenating different time-series segments
o Using generative models to synthesize new time-series data
Choice of IDEs
1. Supervised
• A deep learning (DL) framework. Currently, SageMaker supports RL in TensorFlow and Apache MXNet.
• An RL toolkit. An RL toolkit manages the interaction between the agent and the environment and
provides a wide selection of state of the art RL algorithms. SageMaker supports the Intel Coach and Ray
RLlib toolkits. For information about Intel Coach, see https://nervanasystems.github.io/coach/(opens in
a new tab). For information about Ray RLlib, see https://ray.readthedocs.io/en/latest/rllib.html(opens in
a new tab).
• An RL environment. You can use custom environments, open-source environments, or commercial
environments. For information, see RL Environments in Amazon SageMaker(opens in a new tab).
• Data analysis and processing: SageMaker Autopilot identifies your specific problem type, handles missing
values, normalizes your data, selects features, and prepares the data for model training.
• Model selection: SageMaker Autopilot explores a variety of algorithms. SageMaker Autopilot uses a cross-
validation resampling technique to generate metrics that evaluate the predictive quality of the algorithms
based on predefined objective metrics.
• Hyperparameter optimization: SageMaker Autopilot automates the search for optimal hyperparameter
configurations.
• Model training and evaluation: SageMaker Autopilot automates the process of training and evaluating
various model candidates.
o It splits the data into training and validation sets, and then it trains the selected model candidates
using the training data.
o Then it evaluates their performance on the unseen data of the validation set.
o Lastly, it ranks the optimized model candidates based on their performance and identifies the best
performing model.
• Model deployment: After SageMaker Autopilot has identified the best performing model, it provides the
option to deploy the model. It accomplishes this by automatically generating the model artifacts and the
endpoint that expose an API. External applications can send data to the endpoint and receive the
corresponding predictions or inferences.
2.1.3 SageMake JumpStart
SageMaker JS is a ML hub with foundation models, built-in algorithms, and prebuilt ML solutions that you can
deploy with a few clicks.
Features
Foundation Models
With JumpStart foundation models, many models are available such as:
Amazon SageMaker JumpStart provides developers and data science teams ready-to-start AI/ML models and
pipelines. SageMaker JumpStart is ready to be deployed and can be used as-is. For demand forecasting,
SageMaker JumpStart comes with a pre-trained, deep learning-based forecasting model, using Long- and Short-
Term Temporal Patterns with Deep Neural Networks (LSTNet).
Amazon SageMaker JumpStart solution uses Graph-Based Credit Scoring to construct a corporate network from
SEC filings (long-form text data).
• Fraud detection
Detect fraud in financial transactions by training a graph convolutional network with the deep graph library and a
SageMaker XGBoost model.
• Computer vision
Amazon SageMaker JumpStart supports over 20 state-of-the-art, fine-tunable object detection models from
PyTorch hub and MxNet GluonCV. The models include YOLO-v3, FasterRCNN, and SSD, pre-trained on MS-
COCO and PASCAL VOC datasets.
Amazon SageMaker JumpStart also supports image feature vector extraction for over 52 state-of-the-art image
classification models including ResNet, MobileNet, EfficientNet from TensorFlow hub. Use these new models to
generate image feature vectors for their images. The generated feature vectors are representations of the images
in a high-dimensional Euclidean space. They can be used to compare images and identify similarities for image
search applications.
JumpStart provides solutions for you to uncover valuable insights and connections in business-critical
documents. Use cases include text classification, document summarization, handwriting recognition,
relationship extraction, question and answering, and filling in missing values in tabular records.
• Predictive maintenance
The AWS predictive maintenance solution for automotive fleets applies deep learning techniques to common
areas that drive vehicle failures, unplanned downtime, and repair costs.
• Churn prediction
After training this model using customer profile information, you can take that same profile information for any
arbitrary customer and pass it to the model. You can then have it predict whether that customer is going to churn
or not. Amazon SageMaker JumpStart uses a few algorithms to help with this. LightGBM, CatBoost,
TabTransformer, and AutoGluon-Tabular used on a churn prediction dataset are a few examples.
• Personalized recommendations
Amazon SageMaker JumpStart can perform cross-device entity linking for online advertising by training a graph
convolutional network with a deep graph library.
You could use the model to summarize long documents with LangChain and Python. The Falcon LLM is a large
language model, trained by researchers at the Technology Innovation Institute (TII) on over 1 trillion tokens using
AWS. Falcon has many different variations, with its two main constituents Falcon 40B and Falcon 7B, comprised
of 40 billion and 7 billion parameters, respectively. Falcon has fine-tuned versions trained for specific tasks,
such as following instructions. Falcon performs well on a variety of tasks, including text summarization,
sentiment analysis, question answering, and conversing.
• Financial pricing
Many businesses dynamically adjust pricing on a regular basis to maximize their returns. Amazon SageMaker
JumpStart has solutions for price optimization, dynamic pricing, option pricing, or portfolio optimization use
cases. Estimate price elasticity using Double Machine Learning (ML) for causal inference and the Prophet
forecasting procedure. Use these estimates to optimize daily prices.
• Causal inference
Researchers can use machine learning models such as Bayesian networks to represent causal dependencies
and draw causal conclusions based on data.
2.1.5 Bedrock
Use cases
2.2 Train Models
2.2.1 Model Training Concepts
Minimizing loss:
Log-likelihood loss is an algorithm used for classification tasks, where the goal is to predict whether
an input belongs to one of two or more classes. For example, you might use logistic regression to
predict whether an email is spam.
Optimizing - Reducing Loss function:
Optimization Stochastic gradient Mini-batch gradient
Gradient descent
technique descent descent
Weights updated Every epoch Every datapoint Every batch
Speed of each
Slowest Fast Slower
epoch calculation
Smooth updates toward the Noisy or erratic updates Less noisy or erratic
Gradient steps
minima toward the minima updates toward the minima
Gradient descent
As mentioned, gradient descent only updates weights after it's gone through all of the data, also
known as an epoch. Of the three variations covered here
• gradient descent has the slowest speed to finding the minima as a result, but
• also has the fewest number of steps to reach the minima.
In stochastic gradient descent or SGD, you update your weights for each record you have in your
dataset.
For example, if you have 1000 data points in your dataset, SGD will update the parameters 1000 times.
With gradient descent, the parameters would be updated only once in every epoch.
• SGD leads to more parameter updates and, therefore, the model will get closer to the minima
more quickly.
• One drawback of SGD, however, is that it will oscillate in different directions, unlike gradient
descent, hence lot more steps.
Hybrid of gradient descent and SGD, this approach uses a smaller dataset or a batch of records, also
called batch size, to update your parameters.
• Mini-batch gradient descent updates more than gradient descent while having less erratic or
noisy updates as compared to SGD. The user-defined batch size helps you fit the smaller
dataset into memory. Having a smaller dataset helps the algorithms run on almost any average
computer that you might be using.
2.2.2 Compute Environment
AWS offers solutions for a variety of specific ML tasks, and this permits you to optimize on your particular use
case scenarios.
Model created
• Store in S3
• Package and distribute
• Register model (in registry)
Train a model
For built-in algorithms, the only inputs you need to provide are the
• training data
• hyperparameters
• compute resources.
Amazon SageMaker training options
When it comes to training environments, you have several to choose from:
• Create a training job using SageMaker console (see the Creating a Training Job Using the Amazon
SageMaker Console lesson for an example using this method).
o The low-level SageMaker APIs for the SDK for Python (Boto3) or the AWS CLI
What SageMaker streams data SageMaker will download SageMaker can stream data
directly from Amazon S3 the training data from S3 to directly from S3 to the
to the container, without the provisioned ML storage container with no code
downloading the data to volume. Then it will mount changes. Users can author their
the ML storage volume the directory to the Docker training script to interact with
volume for the training these files as though they were
container. stored on disk.
Pros Improve training In a distributed training Fast File mode works best when
performance by reducing setup ,the training data is the data is read sequentially.
the time spent on data distributed uniformly
download across the cluster.
Cons Manually ensure ML Augmented manifest files are
storage volume has not supported. The startup time
sufficient capacity to is lower when there are fewer
accommodate data from files in the S3 bucket provided.
Amazon S3.
1. Use your local laptop or desktop with the SageMaker Python SDK. You can get different
instance types, such as CPUs and GPUs, but are not required to use the managed notebook
instances.
a) training script
b) instance type
c) other configurations.
4. Call the fit method on the estimator to start the training job, passing in the training and
validation data channels.
5. SageMaker takes care of the rest. It pulls the image from Amazon Elastic Container Registry
(Amazon ECR) and loads it on the managed infrastructure.
6. Monitor the training job and retrieve the trained model artifacts once the job is complete.
Example
In this example, the PyTorch estimator is configured with the training script using the entry_point: train.py,
instance type ml.p3.2xlarge, and other settings. The fit method is called to launch the training job, passing in the
location of the training data.
Reducing training time
Amazon SageMaker script mode provides the flexibility to develop custom training and inference code
while using industry-leading machine learning frameworks.
a) Early stopping:
Early stopping is a regularization technique that shuts down the training process for a ML model
when the model's performance on a validation set stops improving.
a) Evaluating objective metric after each epoch: During the training process, SageMaker evaluates the
specified objective metric (for example, accuracy, loss, F1-score) for each epoch or iteration of the
training job.
b) Distributed training
A. Data parallelism is the process of splitting the training set in mini-batches evenly distributed across
nodes. Thus, each node only trains the model on a fraction of the total dataset.
B. Model parallelism is the process of splitting a model up between multiple instances or nodes.
• If model can fit on a single GPU's memory but your dataset is large, data parallelism is the
recommended approach. It splits the training data across multiple GPUs or instances for faster
processing and larger effective batch sizes.
• If model is too large to fit on a single GPU's memory, model parallelism becomes necessary. It splits the
model itself across multiple devices, enabling the training of models that would otherwise be intractable
on a single GPU.
Building a deployable model package
Step 2: Write a script that will run in the container to load the model artifact. In this example, the script
is named inference.py. This script can include custom code for generating predictions, as well as input
and output processing. It can also override the default implementations provided by the pre-built
containers.
To install additional libraries at container startup, add a requirements.txt file that specifies the libraries
to be installed by using pip.
Step 3: Create a model package that bundles the model artifact and the code. This package should
adhere to a specific folder structure and be packaged as a tar archive, named model.tar.gz, with gzip
compression.
2.3 Refine Models
2.3.1 Evaluating Model Performance
Bias and Variance
a) What are these
Bias Variance
Bias Variance
The model is too simple The model is too complex
Incorrect modeling or feature engineering Too much irrelevant data in training dataset
Inherited bias from the training dataset Model trained for too long on training dataset
2.3.2 Model Fit (Overfitting and Underfitting)
1. Overfit/Underfit
• Overfit
1. Reasons
Training data too small Too much irrelevant data Excessive training time Overly complex
architecture
• Underfit
Reasons
Early stopping
Pauses the training process
before the model learns the
noise in the data.
Pruning
Aims to remove weights
that don’t contribute much
to the training process
Regularization a) Dropout
Randomly drops out (or sets to 0), a number of neurons in each
layer of the neural network during each epoch.
b) L1 regularization
Push the weights of less important features to zero.
c) L2 regularization
Results in smaller overall weight values (and stabilizes the weights)
when there is high correlation between the input features.
Data augmentation perform data augmentation to increase the diversity of the
training data
Model architecture
simplification
b) Remediating Underfitting
The idea behind ensembling is that by combining the strengths of different models, the weaknesses of
individual models can be mitigated. This leads to improved overall performance.
a) Adaptive Boosting
(AdaBoost)
b) Gradient Boosting (GB)
c) Extreme Gradient
Boosting (XGBoost)
Boosting algorithm
Adaptive Boosting (AdaBoost) Gradient Boosting (GB) Extreme Gradient Boosting (XGBoost)
classification • classification • classification
• regression • regression
• large datasets and big data applications.
Bagging (bootstrap aggregation)
Random forests
Stacking
??
2.3.3 Hyperparameter Tuning
Benefits of Hyperparameter tuning
a) Impact of Hyperparameter tuning on model performance
Careful If the learning rate is too high, the A larger batch size can lead to However, too many
algorithm might overshoot the faster convergence but might can result in
optimal solution and fail to require more computational overfitting.
converge. resources.
Neural networks
# of layers # of neurons in each Choice of activation Regularization
layer functions Techniques
more layers -> more more neurons -> more introduce non-linearity into helps prevent
complex processing power the neural network overfitting .
Common activation functions Common
include: regularization
• Sigmoid function techniques
• Rectified Linear Unit • L1 /L2
(ReLU) regularization
• Hyperbolic Tangent (Tanh) • Dropout
• Softmax function • Early stopping
increasing the depth Increasing number of
of a network risks neurons risks
overfitting. overfitting
Decision Tree
Maximum Depth of tree # of neurons in each layer Choice of activation functions
helps manage complexity Sets a threshold that the data must Options to select how algorithm
of the model and prevent meet before splitting a node. evaluates node splits:
overfitting prevents the tree from creating too • Gini impurity: measures purity of
many branches. This also helps to data and the likelihood that data
prevent overfitting could be misclassified.
• Entropy: Measures randomness of
data. The child node that reduces
entropy the most is the split that
should be used.
Hyperparameter tuning techniques
Pros Cons When
Manual When you have a good time-consuming Domain knowledge, and
understanding of the problem prior experience with
at hand similar problems
Grid search Systematic and exhaustive approach to hyperparameter tuning. It involves defining all
possible hyperparameter values and training and evaluating the model for every
combination of these values.
Reliable technique, Computationally Small scale and
especially for smaller-scale expensive. accuracy
problems.
Random
search
More efficient than Grid Optimum hyperparameter
Search combination could be
missed.
Bayesian
optimization
Uses the performance of previous hyperparameter selections to predict which of the
subsequent values are likely to yield the best results.
• can handle composite • More complex to multiple objectives
objectives implement. and/or speed.
• can also converge faster • Works sequentially, so
than random search. difficult to scale.
2. Specify the hyperparameters to tune and the range of values to use for each of the
following: alpha, eta, max_depth, min_child_weight, and num_round.
3. Identify the objective metric that SageMaker AMT will use to gauge model performance.
4. Configure and launch the SageMaker AMT tuning job, including completion criteria to stop tuning after
the criteria have been met.
5. Identify the best-performing model and the hyperparameters used in its creation.
Pruning
Pruning is a technique that removes the least
important parameters or weights from a model.
Quantization
Quantization changes the representation of weights
to its most space-efficient representation.
E.g., instead of a 32-bit floating-point representation
of weight, quantization has the model use an 8-bit
integer representation.
Knowledge distillation
With distillation, a larger teacher model transfers
knowledge to a smaller student model. The student
model is trained on the same dataset as the teacher.
However, the student model is also trained on the
teacher model's knowledge of the data.
2.3.5 Refining Pre-trained models
Benefits of Fine tuning
a) Where fine-tuning fits in the training process
• To work with domain-specific language, such as industry jargon, technical terms, or other specialized
vocabulary
• To have responses that are more factual, less toxic, and better aligned to specific requirements
b) Fine-tuning approaches
a) Detecting
Plot your model's performance over time. If the Make sure your validation sets are representative
model's performance on specific tasks decreases of historic patterns in the data that are still
significantly after training on new data, it might be a relevant to the problem.
sign of catastrophic forgetting.
b) Preventing
b) Benefits
• Catalog models for production
• Manage model versions
• Control the approval status of models within your ML pipeline
• When you want the error in the same units as the target
Square root of MSE, in the
Root Mean Square variable
same units as the target
Error (RMSE) • For easier interpretation of the error magnitude
variable
• When comparing models with different scales
Proportion of variance in the • To understand how well the model fits the data
R-Squared (R²) dependent variable explained • When you want a metric bounded between 0 and 1
by the independent variables • For comparing models across different datasets
a) Impact of convergence
This is where SageMaker AMT can help. It can automatically tune models by finding the optimal
combination of hyperparameters, such as
Improve CNN
How SageMaker AMT improves issues with local maxima and local minima
When training a deep CNN for image classification tasks can encounter saddle points or local minima. This is
because the loss function landscape in high-dimensional spaces can be complex. Having multiple local minima
and saddle points can trap the optimization algorithm, leading to suboptimal convergence.
This is where SageMaker Training Compiler can help. It can automatically apply optimization techniques like
• tensor remapping
• operator fusion
• kernel optimization.
Debug Model Convergence with SageMaker Debugger
Amazon MWAA • When you're familiar with Apache Airflow and prefer DAG-based workflows
(Managed Workflows for • For complex scheduling requirements
Apache Airflow) • When you need to integrate with both AWS and non-AWS services
b) Comparisons: AWS Controllers for Kubernetes (ACK) and SageMaker Components for Kubeflow
Pipelines.
Inference
Description When to Choose
Option
• create ML solutions that anonymize sensitive data, such as personally identifiable information
• guides the configuration of least-privileged access to your data and resources
• suggests configurations for your AWS account structures and Amazon Virtual Private Clouds to
provide isolation boundaries around your workloads.
• helps construct ML solutions that are resistant to disruption while recovering quickly
• guides you to design data processing workflows to be resilient to failures by implementing
error handling, retries, and fallback mechanisms
• recommends data backups, and versioning.
Language Multi-Cloud
Tool Description Typical Use Cases
Support Support
• AWS-only
AWS-native IaC JSON, • Teams familiar with AWS
CloudFormation AWS only ecosystem
service YAML
• Simple to moderate complexity
deployments
TypeScript, • Teams with strong programming
IaC framework skills
CDK (Cloud Python, AWS only (can
that compiles to • Complex AWS infrastructures
Development Kit) Java, C#, be extended)
CloudFormation • Reusable ML infrastructure
Go
components
• Multi-cloud ML deployments
Open-source IaC HCL,
Terraform Excellent • Hybrid cloud scenarios
tool JSON • Teams preferring declarative
syntax
• Requires complex logic
TypeScript,
Modern IaC • Teams preferring familiar
Pulumi Python, Excellent programming languages
platform
Go, .NET • Multi-cloud, complex
architectures
Working with CloudFormation
a) Template
• AWS CDK Construct Library: This library contains a collection of pre-written modular and reusable
pieces of code called constructs. These constructs represent infrastructure resources and collections
of infrastructure resources.
• AWS CDK Toolkit: This is a command line tool for interacting with CDK apps. Use the AWS CDK Toolkit
to create, manage, and deploy your AWS CDK projects.
b) CDK LifeCycle
cdk init
When you begin your CDK project, you create a directory for it, run cdk init, and
specify the programming language used:
• mkdir my-cdk-app
• cd my-cdk-app
• cdk init app --language typescript
cdk bootstrap
You then run cdk bootstrap to prepare the environments into which the stacks
will be deployed. This creates the special dedicated AWS CDK resources for the
environments.
cdk synth
Finally, you can run the cdk deploy command to have CloudFormation provision
resources defined in the synthesized templates.
Comparing CF and CDK
The AWS CDK consists
pipeline = Pipeline(
name=pipeline_name,
parameters=[input_data, processing_instance_type,
processing_instance_count, training_instance_type,
mse_threshold, model_approval_status],
steps = [step_process, step_train, step_evaluate, step_conditional]
)
b) Automating common tasks with the SageMaker Python SDK
You use this configuration in a deploy() method. If the model is already created, you use the Model class to
create a SageMaker model from a model artifact. The model artifact location in Amazon S3 and the code to
use to perform inference as the entry_point is specified:
After deployment is complete, you can use the predictor’s predict() method to invoke the serverless
endpoint:
c) Building and Maintaining Containers
Training Container
Inference container
Point File
serve.py
when the container is started for hosting. It starts the inference server, including
the nginx web server and Gunicorn as a Python web server gateway interface.
predictor.py
This Python script contains the logic to load and perform inference with your
model. It uses Flask to provide the /ping and /invocations endpoints.
wsgi.py
This is a wrapper for the Gunicorn server.
nginx.conf
This is a script to configure a web server, including listening on port 8080. It
forwards requests containing either /ping or /invocations paths to the Gunicorn
server.
When creating or adapting a container for performing real-time inference, your container must
meet the following requirements:
• Your container must include the path /opt/ml/model. When the inference container starts, it
will import the model artifact and store it in this directory.
Note: This is the same directory that a training container uses to store the newly trained model
artifact.
• Your container must be configured to run as an executable. Your Dockerfile should include an
ENTRYPOINT instruction that defines an executable to run when the container starts, as
ENTRYPOINT ["<language>", "<executable>"]
e.g. ENTRYPOINT ["python", "serve.py"]
• Your container must accept POST requests to the /invocations and /ping real-time endpoints.
• The requests that you send to these endpoints must be returned with 60 seconds and have a
size <6 MB.
Auto scaling strategy
a) SageMaker model auto scaling methods
Project
Projects, Kanban boards Issue boards, Epics, Roadmaps
Management
Third-party
Extensive marketplace Fewer, but strong built-in tools
Integrations
• Unit tests
validate smaller components like individual functions or methods.
• Integration tests
can check that pipeline stages, including data ingestion, training, and deployment, work together
correctly. Other types of integration tests depend on your system or architecture.
• Regression tests
In practice, regression testing is re-running the same tests to make sure something that used to work
was not broken by a change.
Feature
Feature branches from develop Feature branches from main
Development
Release Process Dedicated release branches Direct to main via pull requests
Suited For Scheduled releases, larger projects Continuous delivery, smaller projects
c) GitFlow
3.3.3 AWS Software Release Processes
Continuous Delivery Services
a) AWS CI/CD Pipeline
Unique
1. Set up IAM role • Review CodeDeploy logs
Automates application
2. Create CodeDeploy app • Verify CodeDeploy agent
CodeDeploy deployments to various 3. Create deployment group
compute platforms • Validate AppSpec file
4. Define deployment
configuration • Check instance health
• Analyze the rollback reason
Primary Purpose CI/CD and release automation Workflow orchestration and coordination
Workflow Type Linear, predefined stages Complex, branching workflows with conditional logic
Comparison deployment
• When you need instant • To test new features with a subset • When you have stateful
rollback capability of users application
• For critical • When you want to gather user • For large-scale deployments
applications requiring feedback before full release where cost is an issue
zero downtime • For applications with high traffic • When you can tolerate having
• When your application where you want to minimize risk mixed versions temporarily
can handle sudden
traffic shifts
The baking period is a set time for monitoring the green fleet's performance before completing the full transition,
making it possible to rollback if alarms trip. This period builds confidence in the new deployment before permanent
cutover.
3.3.4 Retraining models
Retraining models
a) Retraining mechanisms
Catastrophic forgetting is a type of over-fitting. The model learns the training data too well such that it no
longer performs well on other data.
Modify network architecture to • Can be very effective • May increase model complexity
Architectural
accommodate new tasks • Can be challenging to design
• Doesn't require old data
Training Inferencing
Standalone item not integrated into application stack Integrated into application stack workflows
Runs in the cloud Runs on different devices at the edge and cloud
Typically runs less frequently and on an as-needed basis Runs an indefinite amount of time
Compute capacity requirements are typically predictable, Compute capacity requirements might be dynamic
so wouldn't require auto scaling and unpredictable, so would require auto scaling
Domain 4: Monitor Model
4.1 Monitor Model Performance and Data Quality
4.1.1 Monitoring Machine Learning Solutions
Importance of Monitoring in ML
a) Machine Learning Lens: AWS Well-Architected Framework: Best practices and design principles
Automate Retraining
Detecting Drift in Monitoring
a) Drift Types
Note: Bias inverse of variance, which is the level of small fluctuations or noise common in complex data sets.
Bias tends to cause model predictions to overgeneralize, and variance tends to cause models to
undergeneralize. Increasing variance is one method for reducing the impact of bias.
b) Monitoring Drift
STEPS
To monitor model quality, SageMaker Model Monitor requires the following inputs:
1. Baseline data
2. Inference input and predictions made by the deployed model
3. Amazon SageMaker Ground Truth associated with the inputs to the model
Post-training bias metrics in SageMaker Clarify help us answer two key questions:
• Are all facet values represented at a similar rate in positive (favorable) model predictions?
• Does the model have similar predictive performance for all facet values?
How it works: It quantifies the contribution of each input feature (for example, audio characteristics)
to the model's predictions, helping to explain how the model arrives at its decisions.
Options for using SageMaker Clarify
Uses SHAP
SageMaker Clarify provides feature attributions based on the concept of Shapley value. This is a game-
theoretic approach that assigns an importance value (SHAP value) to each feature for a particular
prediction.
1. SageMaker Clarify: This is the core component that performs the actual bias detection
and generate quality metrics and violations
2. SageMaker Model Monitor: This is the framework that can use Clarify's capabilities to
perform continuous monitoring of deployed models.
SageMaker Model Dashboard
Features
1. Alerts :
How it helps: The dashboard provides a record of all activated alerts, allowing the data
scientist to review and analyze past issues.
Alert criteria depend upon two parameters:
• Datapoints to alert: Within the evaluation period, how many runtime failures raise an alert?
• Evaluation period: The # of most recent monitoring executions to consider when evaluating
alert status.
2. Risk rating
A user-specified parameter from the model card with a low, medium, or high value.
3. Endpoint performance
You can select the endpoint column to view performance metrics, such as:
• CpuUtilization: The sum of each individual CPU core's utilization from 0%-100%.
• MemoryUtilization: The % of memory used by the containers on an instance, 0%-100%.
• DiskUtilization: The % of disk space used by the containers on an instance, 0%-100%.
This information helps you determine if a model is actively used for batch inference.
When training a model, SageMaker creates a model lineage graph, a visualization of the entire ML
workflow from data preparation to deployment.
• Stakeholder notifications: When monitoring metrics indicate changes that impact business
KPIs or the underlying problem
• Data Scientist notification: You can use automated notifications to data scientists when
your monitoring detects data drift or when expected data is missing.
• Model retraining: Configure your model training pipeline to automatically retrain models
when monitoring detects drift, bias, or performance degradation.
• Autoscaling: You use resource utilization metrics gathered by infrastructure monitoring to
initiate autoscaling actions.
• Predictable
• When there are known maintenance • May retrain unnecessarily if
seasonal patterns schedule no significant changes occur
Scheduled
• For maintaining model • Can anticipate and • Might miss sudden,
accuracy over time prepare for retraining unexpected changes
periods
4.2 Monitor and Optimize Infrastructure and Costs
4.2.1 Monitor Infrastructure
Monitor Performance Metrics - CloudWatch vs Model Monitor
Feature SageMaker Model Monitor CloudWatch Logs
Alert System Set alerts for deviations in model quality Notifications based on preset thresholds
• Pre-built monitoring capabilities (no coding) Customizable log patterns and anomaly
Customization
• Custom analysis options detection
Continuous collection and analysis of Deep insights into internal state and behavior of ML
Definition
metrics systems
Detect issues and invoke alerts or Provide deeper insights for troubleshooting and
Outcome
automated actions optimization
Primarily focused on predefined metrics Enables asking and answering questions about
Scope
and thresholds system behavior
Monitoring Tools (for Performance and Latency)
CloudWatch CloudWatch Logs
Feature AWS X-Ray QuickSight
Lambda Insights Insights
• Interactive
querying and
• Works across AWS • Monitors metrics
analysis of log data • Interactive
and third-party (memory, duration,
• Correlates log data dashboards
services invocation count)
from different • ML-powered
Key Features • Generates detailed • Provides detailed
sources
service graphs logs and traces insights
• Visualizes time • Supports various
• Identifies • Helps identify
series data data sources
performance bottlenecks in
• Supports
bottlenecks Lambda functions
aggregations,
filters, and regex
Any service that Various AWS services
Compatible EC2, ECS, Lambda,
Lambda generates logs in and external data
Services Elastic Beanstalk
CloudWatch sources
SageMaker w/ EventBridge
c) How to start
Short-term,
On-Demand Pay-per-use with no long-term Real-time inference
unpredictable None (baseline)
Instances commitment services
workloads
Steady-state,
Reserved Discounted rates for 1 or 3- Up to 72% vs On- Long-running ML
predictable
Instances year commitments Demand training jobs
workloads
Commit to a specific
Savings Plans for Flexible, recurring Up to 64% vs On- Regular model training
compute usage for 1 or 3
SageMaker SageMaker usage Demand and deployment
years
4.3 Secure AWS ML Resources
Allows SageMaker to
SageMaker
perform tasks on behalf of General SageMaker operations
Execution
users
Specific to SageMaker
Processing Job Data processing tasks
Service processing jobs
Roles
Specific to SageMaker
Training Job Model training tasks
training jobs
Specific to SageMaker
Model Model deployment and hosting
model deployment
Resource
ID Purpose Key Permissions Notes
Scope
• SageMaker: CreateTrainingJob,
CreateModel
• S3: GetObject, PutObject
• ECR: BatchGetImage
Least privilege Adheres to
• CloudWatch: PutMetricData Specific ARNs
1 access for ML principle of least
for each service
workflow privilege
• machinelearning:Get*
• machinelearning:Describe*
Specific
Allows reading
MLModel ARNs
Read metadata metadata but not
2 for Get*<br>*
of ML resources modifying
(all) for
resources
Describe*
• machinelearning:CreateDataSourceFrom*
Cannot be
Create ML • machinelearning:CreateMLModel
3 * (all) restricted to
resources • machinelearning:CreateBatchPrediction
specific resources
• machinelearning:CreateEvaluation
Allows
Manage real- • machinelearning:CreateRealtimeEndpoint
Specific management of
4 time endpoints • machinelearning:DeleteRealtimeEndpoint
MLModel ARN endpoints for a
and predictions • machinelearning:Predict
specific model
Detailed examples
4. Allow users to create /delete real-time endpoints and perform real-time predictions on an ML model
Detailed examples
• S3
• CloudWatch Logs
• SageMaker runtime
• SageMaker API
4.3.3 SageMaker Compliance & Governance
AWS Services for Compliance and Governance
Service Purpose Key Features ML-Related Use Case
Amazon Automated vulnerability Continuous scanning for Scan container images in ECR
Inspector management vulnerabilities for ML model deployments
Governance
Description AWS Services to Use
/Framework
• AWS Config
ISO 27001 Information Security Management System standard
• AWS Security Hub
• AWS Artifact
SOC 2 Service Organization Control for service organizations • AWS Config
• SageMaker Model Cards
• AWS Config
PCI-DSS Payment Card Industry Data Security Standard • AWS WAF
• Amazon Inspector
• AWS Artifact
HIPAA Health Insurance Portability and Accountability Act • AWS Security Hub
• AWS Config
• AWS CloudTrail
FedRAMP Federal Risk and Authorization Management Program
• AWS Config
• Caller identity
Identify unauthorized API calls to
CloudTrail Logs Monitor API calls • Timestamps
SageMaker resources
• API details
Data Event Monitor data Input/output data for training Verify if unauthorized entities
Logs plane operations and inference accessed model data
AWS Enhance network Private connections between Ensure traffic remains within
PrivateLink security VPC and SageMaker AWS network
Domain X: Misc
X.1 SageMaker Deep Dive
X.1.1 Fully Managed Notebook Instances with Amazon SageMaker
Elastic Inference
Elastic Inference is a service that allows attaching a portion of a GPU to an existing EC2
instance.2 This approach is particularly useful when running inference locally on a notebook
instance.2 By selecting an appropriate Elastic Inference configuration based on size, version,
and bandwidth, users can accelerate their inference tasks without needing a full GPU.2
• LDA Unsupervised
Topic Modeling
• NTM
Interactive development and training • During the initial stages of model development
5. Notebook
using Jupyter notebooks on managed • When you need an interactive environment for
Instance instances debugging and visualization
Key Considerations:
• Skill Level: Built-in Algorithms and Marketplace for beginners, Script Mode and Containers
for more advanced users
• Customization Needs: From low (Built-in) to high (Containers)
• Development Speed: Notebooks for rapid prototyping, Built-in for quick deployment,
Containers for complex but reproducible setups
• Scale: Consider moving from Notebooks to other options as your data and model
complexity grow.
X.1.4 Train Your ML Models with Amazon SageMaker
Splitting Data for ML
X.1.5 Tuning Your ML Models with Amazon SageMaker
Maximizing Efficiency across tuning jobs
How to automate
Put a check to see if Accuracy falls below a % (e.g. > 80%), invoke Human in the loop
X.1.6 Add Debugger to Training Jobs in Amazon SageMaker
How it works
1. Add debugging hook:
o An EC2 instance with an attached EBS volume is used to initiate the process.
o The debugging hook is added to the training job configuration.
2. Hook listens to events and records tensors:
o Docker containers running on EC2 instances are used for the training job.
o The hook listens for specific events during the training process and records tensor data.
3. Debugger applies rules to tensors:
o Another EC2 instance with a Docker container is used for debugging.
o The debugger applies predefined rules (mentioned as "x15!!" in the image) to the recorded
tensor data.
Benefits of debugger
1. Comprehensive Built-in Rules/Algorithms: The debugger offers a wide range of built-in rules to
detect common issues in machine learning models, such as:
3. Easy Integration: The entry point is 'mnist.py' and it works with SageMaker's built-in algorithms
(1P SM algos), suggesting easy integration with existing SageMaker workflows.
4. No Code Changes Required: The "No Change Needed" text implies that adding debugging
capabilities doesn't require modifying the existing model code.
6. Real-time Monitoring: The variety of rules suggests that the debugger can monitor various
aspects of model training in real-time, helping to identify issues as they occur.
Deployment
Description When to Use
Strategy