AWS Machine Learning Specialty Master Cheat Sheet
AWS Machine Learning Specialty Master Cheat Sheet
Distributions
PDF continuous (normal distribution)
PMF (mass) discrete
Poisson – series of events where the average number of successes or failure are
known. Possion = discrete
Binomial multiple 0/1 trials
Bernoulli = special case of binomial where we have ONE trial
Time Series
Know the difference between Seasonality vs. Trends
Noise is present in time series
Additive
Multiplicative (scales with trends)
pg. 1
SKILLCERTPRO
L1 regularization vs L2
L1 is sum of weights, L2 is sum of weights^2
L1 performs feature selection, some features can go to 0, can reduce dimensionality,
more computationally intensive
L2 nothing goes to zero. Computationally efficient. Use L2 when we think all features
are important
Hyperparameters (we set before training starts) vs parameters (internal to model
that get tuned during training)
Hyperparameter
Learning rate 0-1, step size
Batch size, number of samples to train at a time
Epoch, number of times we will process the training data
Cross Validation
Don’t hold out specific records for validation
Retrain by repartitioning and holding out a % of the data for validation each time
k-fold cross validation
Feature Selection and Engineering
Selecting relevant data to be trained on
Remove unneeded data (low correlation, low variance, missing data) we don’t need
to train our model. Makes model training faster and hopefully more accurate
Feature correlation, e.g. Age and Height
Selection requires trial and error, and domain knowledge
Engineering new features based on existing features. E.g. Height/age, or pulling the
weekday from a date
PCA and K-means clustering (both unsupervised) can help reducing the feature set
Principal Component Analysis PCA
Unsupervised learning algorithm
Used for data preparations pre-processing, looks for relationships between data
using dimension reduction
Find central point of all data on n-dimensional graph
Turn that point into the origin on the graph
Draw a minimum bounding box around all of the points
Longest length of the box is PC1, next longest is PC2, etc.
Take out dimensions that don’t affect the data much
Project higher dimensional data into lower dimensional (like a 2d plot)
Missing and Unbalanced Data (Imputation)
Impute a value that is missing, take the mean (Mean replacement of the column).
Median might work as well. But this isn’t very great TBH
Remove the sample altogether
Remove the column or feature altogether
Unbalanced – outliers
Outlier detection – random cut forests (AWS developed algorithm)
1 – 2 std dev. Std dev = sqrt(variance). variance = (each point – mean)^2 / number
of samples
Unbalanced – not enough examples for all of our classes
Can create fake data “synthesize data” using expert domain knowledge to create
more examples for your class
Actual good imputation methods
K nearest neighbours (numerical data)
Deep learning
pg. 2
SKILLCERTPRO
ML Algorithms
Logistical Regression
Supervised
Binary yes no output
Fit a sigmoid function (S shaped), less likely to be skewed by outliers
pg. 3
SKILLCERTPRO
Linear Regression
Supervised
Numeric output
SVMs
Supervised
Classification output
Partition into groups with furthest distance
Where’s the best line or hyperplane to separate two classes?
Decision Trees
Supervised
Binary, Numeric and Classification output
Root node is one with most correlation with the label
Random Forests
Supervised
Binary, Numeric and Classification output
A collection of decision trees
DTs have a drawback, inaccuracies
RF makes DTs more accurate
RF picks random features an ignores the other features. Builds a DT. Repeat this
so that we get many DTs
We run a record through all of the trees to get our result. Then we vote based on all
of the results
K-Means
Unsupervised clustering
Classification
K is the number of classes we want to find
Tries to find centre points for each cluster until we reach equilibrium
What value of K should we use?
Use variation, least variation wins
Plot the reduction in variation vs. number of clusters
This graph looks like an elbow plot. The elbow’s number of clusters is what we want
K-Nearest Neighbour
Supervised
Classification
Often used after K-means
pg. 4
SKILLCERTPRO
Deep Learning
Activation Functions
Linear (can’t do backprop, no derivative). Does ‘nothing’ just outputs the input that
was given
Binary step functions – don’t work with derivatives_|-
We want non-linear activation functions:
Sigmoid (Logistic)
TanH (more widely used than sigmoid), centred around 0. Good for RNN
ReLU rectified linear unit (looks like this _/). Very popular, fast computation
Leaky ReLU, other variants
Softmax – often the final output layer of classification problem. Converts outputs to
probabilities of each classification. Only outputs ONE label
Sigmoid can output more than one label, e.g image has X and Y
Neural Networks
Input layer, hidden layers, output layers
Activation function inside hidden layers introduces nonlinearity,
sigmoid (0 to 1), ReLU family (most common), Tanh (-1, 1)
ReLU (0 to 1) piecewise, looks like: _/
A Bias is introduced in hidden layers to prevent a 0 value, as 0* anything is 0, and
keep this neuron “active” in the network
Adjusting bias and weights is how we tune a NN
Forward pass
Back propagation to optimize a loss function using Gradient descent
Forward + backwards = 1 epoch
pg. 5
SKILLCERTPRO
Convolutional NN
Supervised, classification
Used when we don’t know where to find our feature, e.g. find something in an image,
find features in a sentence
Used in image classification
Hidden layers are convolutional layers
A convolutional neuron does more than just an activation function
Combine multiple filters and those calculations form the value that is output from the
neuron
Filters can be pre trained
Recurrent NN
Supervised, other output (time series data, voice recognition, translation)
RNN can remember a bit, a small amount of data from past inferences
Deals with sequences in time, or sentences, stock behavior, website logs
Long short term memory (LSTM) and Gated Recurrent Units (GRU) are RNNs.
These solve the issue where the RNN is more biased towards more recent data
compared to earlier data
GRU slightly less performant but trains faster as it is simpler
LSTM more performant but more computationally expensive
The previous input into the neuron at time 1 is an input for the next activation at time
2. It is a “memory cell”
Learning Rate
Is a hyperparameter
Too high, can overshoot the minima and miss the optimal solution (potentially)
Too low means we take a ton of epochs to reach the minima (increase training
time)
Batch Size
# of training samples used in each epoch
Smaller batch sizes tend to not get stuck in local minima when compared to large
batch sizes
Large batch sizes can converge on the wrong solution at random
pg. 6
SKILLCERTPRO
ROC/AUC
How to set threshold, or cutoff point for sensitivity vs specificity
Build lots of confusion matrices and graph Sensitivity vs (1- specificity)
It’s a graph from 0 to 1
That line is receiver operating characteristics. Look for the knee points
It allows us to find the best model for max specificity and max sensitivity
AUC is area under curve is how well this model performs. More area is better, max
area is 1. More AUC means this model is better
ROC – balance between sensitivity and specificity
AUC – compare different models in terms of their separation power. 0.5 is useless as
its the diagonal line. 1 is perfect
Gini Impurity
Information gain algorithm to see how to create the first node and make the best split
in a decision tree
1 – (probability of class A)^2 – (probability of other)^2
pg. 7
SKILLCERTPRO
Then calculate a weighted avg using the total numbers and the two gini impurities
Repeat for all of the features that exist
Compare the weighted gini impurity. Lowest is better. It best separates the classes.
F1 Score
Often used as a replacement for accuracy
F1 combines recall and precision, takes into account more than accuracy
2/(1/recall + 1/precision) or 2TP/(2TP+FP+FN)
Higher is better
Use when we care about both precision and recall
TF-IDF
Term frequency and Inverse Document Frequency
What terms are most relevant for a document
Term Frequency – how often a word appears in a doc
Document Frequency – how often a word appears in ALL DOCS (this can let us get
rid of words used everywhere like “and”, “But”)
TF / DF or TF * IDF. IDF = 1/DF
Unigrams, Bigrams, n-grams
I love certification exams
unigrams = every work individually
Bi-grams = every two consecutive words “i love” “love certification” “certification
exams”
Ensemble Learning
Bagging – generate new training sets with random sampling with replacement
Boosting – training sets will have weights, and as we retrain the weights will change
Boosting generally has better accuracy, Bagging avoids overfitting
ML and DL frameworks
Keras is an easier way to access Tensorflow (Google)
AWS is MXNet and Gluon. Gluon is an abstraction of MXNet (like Keras)
Pytorch and Scikit learn
pg. 8
SKILLCERTPRO
Tensorflow
Graph object is like an array that you populate with code lines from top to bottom
You store constants, variables with no assigned value yet (placeholder), into the
default graph of tensorflow
You can add operations like multiply, add to the graph sequentially
You can run the graph using Sessions
PyTorch
Create a tensor which is a multi dimensional array with zeros
You can just do simple operations like add, multiple on those matrices directly
You need to add requires_grad=True to add memory to store the order of operations
that were done for back propagation purposes
Graph is created on the fly
MXNet
Graph is created on the fly
Similar to PyTorch, turn on the autograd feature to get backprop
Scikit Learn
Has lots of built in datasets to use
Support for less popular models. The other focus more on neural nets
Pandas
Dataframe = table
Manipulate data
AWS
S3
Use as a “Data lake”
Kinesis->S3->Athena->SageMaker
Athena perform queries against S3 using SQL
AWS Glue service
Security and encryption
Glue
pg. 9
SKILLCERTPRO
Athena
SQL interface into S3 data lakes
Works on many different formats, json, csv, parquet etc
Save query results back into S3 – preprocess data before ML
After running a query, a csv file is created in a auto-generated bucket which stores
the query result
Can save queries
Serverless
Can create tables or views from queries
Works well with Glue Catalog
Amazon Quicksight
Visualize data sources, end user targeted. Should use federated auth
Not AWS service really
Create dashboards, reports, graphs
Quicksight can natively connect to many AWS areas for data use
pg. 10
SKILLCERTPRO
EC2 for ML
Use computer optimized or Accelerated Compute (GPU instances)
There’s a class of ml.* instances but those are only available from SageMaker
Lots of AMIs preloaded with machine learning languages and libraries
Conda libraries
Bas AMIs with GPU libraries
You must request limit increases to use any ML suitable compute instances
Batch
Docker images, serverless
pg. 11
SKILLCERTPRO
pg. 12
SKILLCERTPRO
Polly
Text to speech
Many languages, Female or male voices
Upload a lexicon to customize pronunciation (read full acronyms)
Can pass text in, in SSML format “speech synthesis markup language”, looks like
XML. Like you can put in whisper effects, pauses
Transcribe
Speech to Text (ASR – Automatic speech recognition)
Real time or analyse pre-recorded files
You can create custom vocabularies
Put words in a text file, specify the language and upload it (or put it into S3)
You can create transcription jobs
Speaker identification
Translate
Batch or real time
Supports custom terminology you can pass in dictionaries in csv or tmx format
Comprehend
Text analysis (NLP)
Can train on our own data
Features
Keyphrase extraction
Sentiment analysis (=ve, -ve, neutral, mixed)
Syntax analysis (separate into pronouns, verbs, adjectives etc)
Entity recognition (names, organizations, dates)
Custom entities
Language detection
Custom classification (provide training data)
Topic modelling
Multi language support
Lex
pg. 13
SKILLCERTPRO
Powers Alexa
Conversation interface service for chatbots
Tries to understand intent from your speech
Create a “bot”, can output voice or output text
Then create Utterances (training data) and “Intent” (labels)
Forecast
Time series forecasting
Sagemaker
Build train and deploy ML models (3 stages of Sagemaker)
Fully managed service
End to end lifecycle of ML
Lots of managed algorithms, we just choose hyperparameters
Can access/control Sagemaker using the console (web), API (boto3), Python SDK
and Jupyter notebooks
Notebooks
Have Notebook instance types, with ml. prefix
You can still spin up other instances, you’re not tied to your Notebook Instance Type
You can give access to S3 buckets
You don’t have access to VPC by default unless you set them
You access the notebook instance through a presigned url
Lifecycle configurations are used to run bash commands that run before your
notebook instance starts
Sagemaker Build
Data Preprocessing
Visualize your data (notebooks)
Explore data
Feature engineering
pg. 14
SKILLCERTPRO
Synthesize (generate more training data for certain labels if we have less cases)
Convert, e.g. images to recordIO, csv into something else
Split data (validate, test, train)
Structure
Algorithms
3 sources for Sagemaker
Built-in to Sagemaker
AWS Marketplace
Custom
Linear Learner
Can do regression and classification
Input: RecordIO (preferred), CSV. Inputs can be pipe (faster) or file mode
Must normalize data first
BlazingText
Text classification, sentiment analysis, etc
Used by Amazon Comprehend probably
2 modes: Word2vec or text classification
Object2Vec
turn objects into features
Unsupervised, figure out similarity between objects
Image Classification Alg (object detection is bounding boxes)
Conv NN
Image recognition (possibly powers Amazon Rekognition)
Can use “transfer learning” i.e. build on an existing model
K-Means
Web scale k-means clustering algorithm
Find discrete groupings within data (unsupervised algorithm)
Latent Dirichlet Allocation (LDA)
Text analysis, topic discovery
Unsupervised
Amazon Comprehend
pg. 15
SKILLCERTPRO
Sagemaker Train
Architecture
ECS + docker images
Can create our own images
docker image structure /opt/ml/code, opt/ml/model
S3 (Training data) – or elastic file system, or FSx for Lustre
Has “Channels” that need to be defined e.g. train, validation, model
Channel tells what kind of data this is?
EC2 instances (ML class) – we can’t get into the OS of these
P2 family is GPU
Sometimes can elastically attach GPUs to an instance
There’s “spot instances” for training called “managed spot training”
Can keep state using “checkpoints” in s3 if your instance is destroyed. It stops
gracefully
pg. 16
SKILLCERTPRO
Hyperparameter Tuning
Sagemaker auto parameter tuning as a service
Choose an algorithm -> Set ranges of hyperparameters -> Choose metric to
measure (e.g. maximize area under curve)
Sagemaker will run a whole bunch of training models in parallel. There is a “tuning
model” looking at the hyperparams
Sagemaker Deploy
Inference Pipelines
Chaining models together
Pass output of one model to be used as input to another model
Deploy
Create a model definition
Choose a IAM role, pass in the “training image” (ECR docker container) from the
“Training Job”, the model S3 location
Create an Endpoint configuration
Point it to the model definition
Create the Endpoint
Name, choose the endpoint configuration
We use the Endpoint to make inferences. Endpoints can’t be accessed
publicly. You can access it from Lambda, or CLI, with the “aws sagemaker-runtime
invoke-endpoint….” Command
After invoking, the output is probably just an array of numbers, labels, whatever
Accessing SageMaker Endpoints from an App
AWS api/sdk à SageMaker Endpoint is one way
API Gateway à Lambda à SageMaker Endpoint is another
pg. 17
SKILLCERTPRO
Security
SageMaker Notebooks
IAM policy – CreatePresignedNotebookInstanceUrl –
Give notebook root access (server access)? Set this during creation default is true .
Lifecycle scripts run as root.
SageMaker instance profiles, e.g. to grant permissions to S3
SageMaker doesn’t support resource-based policies e.g. an S3 bucket policy
From Notebooks, we can see S3 buckets, and files, but we can’t copy them by
default
SageMaker VPCs
The default is a public VPC i.e. access to internet
If we are in a private VPC, we need a S3 VPC endpoint to access S3
Other
Horovod or parameter servers – how to do distributed training in tensorflow
Production variants – how to do a/b (% traffic to A, % to B) testing of models using
production data
Amazon NEO is a cross compiler that lets you use models in different architectures
pg. 18
SKILLCERTPRO
Create data repositories for machine learning. Identify and implement a data-
ingestion solution. Identify and implement a data-transformation solution. Opinion:
IMHO this domain should be reduced to 15% or even 10%. I found the questions
pretty repetitive, and they were about Big Data, not about Machine Learning. If
you’ve already passed the Big Data Specialty certification, you’ll be fine. If not, make
sure you’re very familiar with Kinesis and its different flavours, or you’ll have a
miserable time.
Sanitize and prepare data for modeling. Perform feature engineering. Analyze and
visualize data for machine learning. Opinion: typical Data Science stuff, not really tied
to any particular AWS service. Cleaning data, handling missing values, performing
basic feature engineering. If you have hands-on ML experience, this won’t be a
problem at all. Questions don’t go very deep. I was surprised to get a few questions
on data viz, most of them pretty vague and awkward to answer without looking at
any actual data. IMHO they should be dropped and replaced with more questions on
feature engineering.
Query logging using Athena Cloudtrail integration with Athena Amazon Macie
Glacier Vault lock Quicksight Different e-mail protocols in secure
port https://www.siteground.com/tutorials/email/protocols-pop3-smtp-imap/
pg. 19
SKILLCERTPRO
White paper
https://d1.awsstatic.com/whitepapers/Security/AWS_Security_Whitepaper.pdf
https://d1.awsstatic.com/whitepapers/aws-kms-best-practices.pdf
https://d0.awsstatic.com/whitepapers/compliance/AWSSecurityatScaleLogginginAWS
_Whitepaper.pdf
https://d1.awsstatic.com/whitepapers/architecture/AWS-Security-Pillar.pdf
https://d1.awsstatic.com/whitepapers/Security/DDoS_White_Paper.pdf
https://d1.awsstatic.com/whitepapers/Security/Secure_content_delivery_with_CloudFr
ont_whitepaper.pdf
https://d0.awsstatic.com/whitepapers/compliance/AWS_Security_at_Scale_Logging_in
_AWS_Whitepaper.pdf
https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.htmlWong
https://aws.amazon.com/training/learning-paths/machine-learning/exam-
preparation/
Machine learning
pg. 20
SKILLCERTPRO
pg. 21
SKILLCERTPRO
pg. 22
SKILLCERTPRO
Model parameter :
Parameters are those which would be learned by the machine like Weights and
Biases.
Hyperparameter :
Hyper-parameters are those which we supply to the model, for example: number of
hidden Nodes and Layers,input features, Learning Rate, Activation Function etc in
Neural Network,
https://machinelearningmastery.com/difference-between-a-parameter-and-a-
hyperparameter/
Learning Rate :
Determines the size of the step taken during gradient descent optimization ,Between
0 and 1
Batch Size :
pg. 23
SKILLCERTPRO
Epoches :
The number of times that the algorithm will process the entire training data.
Each epoch contains one or more batches
Each epoch should see the model get closer to the desired state
Usually a high number :
10,100,1000 and up
Disclaimer: All data and information provided on this site is for informational
purposes only. This site makes no representations as to accuracy, completeness,
currentness, suitability, or validity of any information on this site & will not be
liable for any errors, omissions, or delays in this information or any losses,
injuries, or damages arising from its display or use. All information is provided on
an as-is basis.
pg. 24