0% found this document useful (0 votes)
40 views62 pages

ML Notes

ML is the process of training a model to make predictions from data. There are four main types of ML: supervised learning which learns from labeled examples, unsupervised learning which finds patterns without labels, reinforcement learning which learns through rewards/penalties, and generative AI which creates new content. Supervised learning is used for classification and regression tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views62 pages

ML Notes

ML is the process of training a model to make predictions from data. There are four main types of ML: supervised learning which learns from labeled examples, unsupervised learning which finds patterns without labels, reinforcement learning which learns through rewards/penalties, and generative AI which creates new content. Supervised learning is used for classification and regression tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

What is ML?

ML is the process of training a piece of software, called a model, to make useful predictions or
generate content from data.

Types of ML Systems
ML systems fall into one or more of the following categories based on how they
learn to make predictions or generate content:

• Supervised learning
• Unsupervised learning
• Reinforcement learning
• Generative AI

Supervised learning
Supervised learning models can make predictions after seeing lots of data with the
correct answers and then discovering the connections between the elements in the
data that produce the correct answers. This is like a student learning new material by
studying old exams that contain both questions and answers. Once the student has
trained on enough old exams, the student is well prepared to take a new exam.
These ML systems are "supervised" in the sense that a human gives the ML system
data with the known correct results.

Two of the most common use cases for supervised learning are regression and
classification.

What is (Supervised) Machine Learning?

ML systems learn

how to combine input

to produce useful predictions

on never-before-seen data

Regression
A regression model predicts a numeric value. For example, a weather model that
predicts the amount of rain, in inches or millimeters, is a regression model.

See the table below for more examples of regression models:

Scenario Possible input data Numeric prediction

Future Square footage, zip code, number of bedrooms and bathrooms, The price of the home.
house price lot size, mortgage interest rate, property tax rate, construction
costs, and number of homes for sale in the area.
Future ride Historical traffic conditions (gathered from smartphones, traffic The time in minutes and
time sensors, ride-hailing and other navigation applications), seconds to arrive at a
distance from destination, and weather conditions. destination.

Classification

Classification models predict the likelihood that something belongs to a category.


Unlike regression models, whose output is a number, classification models output a
value that states whether or not something belongs to a particular category. For
example, classification models are used to predict if an email is spam or if a photo
contains a cat.

Classification models are divided into two groups: binary classification and
multiclass classification. Binary classification models output a value from a class
that contains only two values, for example, a model that outputs either rain or no rain.
Multiclass classification models output a value from a class that contains more than
two values, for example, a model that can output either rain, hail, snow, or sleet.

Unsupervised learning
Unsupervised learning models make predictions by being given data that does not
contain any correct answers. An unsupervised learning model's goal is to identify
meaningful patterns among the data. In other words, the model has no hints on how
to categorize each piece of data, but instead it must infer its own rules.

A commonly used unsupervised learning model employs a technique


called clustering. The model finds data points that demarcate natural groupings.

Clustering differs from classification because the categories aren't defined by you. For
example, an unsupervised model might cluster a weather dataset based on temperature,
revealing segmentations that define the seasons. You might then attempt to name those
clusters based on your understanding of the dataset.
What distinguishes a supervised approach from an unsupervised approach?
A supervised approach is given data that contains the correct answer.
A supervised approach is given data that contains the correct answer. The model's job is to find
connections in the data that produce the correct answer. An unsupervised approach is given
data without the correct answer. Its job is to find groupings in the data.

Reinforcement learning
Reinforcement learning models make predictions by getting rewards or penalties
based on actions performed within an environment. A reinforcement learning system
generates a policy that defines the best strategy for getting the most rewards.

Reinforcement learning is used to train robots to perform tasks, like walking around
a room, and software programs like AlphaGo to play the game of Go.

Generative AI
Generative AI is a class of models that creates content from user input. For example,
generative AI can create unique images, music compositions, and jokes; it can
summarize articles, explain how to perform a task, or edit a photo.
Model Input Example output

Text- Who invented the Le Mans race? The 24 Hours of Le Mans was founded by the Automobile Club de l'Ouest (ACO),
to-text a motoring organization founded in 1906. The first race was held on 26 and 27 May
1923 on the Sarthe Circuit, a 13.426-kilometre (8.345-mile) road circuit located
near the town of Le Mans in France. The race was originally conceived as a way to
promote the ACO and the Sarthe region.
Source: Bard

Text- An alien octopus floats through a


to- portal reading a newspaper.
image

Source: Imagen

Text- A photorealistic teddy bear is


to- swimming in the ocean at San
video Francisco. The teddy bear goes under
water. The teddy bear keeps
swimming under the water with
colorful fishes. A panda bear is
swimming under water.
Source: Phenaki

Text- Write a Python loop that loops over a for number in numbers:
to-code list of numbers and prints the prime # Check if the number is prime.
numbers. is_prime = True
for i in range(2, number):
if number % i == 0:
is_prime = False
break
# If the number is prime, print it.
if is_prime:
print(number)

Source: Bard
Model Input Example output

Image- This is a flamingo. They are found in the Caribbean.


to-text Source: Google DeepMind

Generative AI can take a variety of inputs and create a variety of outputs, like text,
images, audio, and video. It can also take and create combinations of these. For
example, a model can take an image as input and create an image and text as
output, or take an image and text as input and create a video as output.

We can discuss generative models by their inputs and outputs, typically written as
"type of input"-to-"type of output." For example, the following is a partial list of some
inputs and outputs for generative models:

• Text-to-text
• Text-to-image
• Text-to-video
• Text-to-code
• Text-to-speech
• Image and text-to-image

The following table list examples of generative models, their input, and an example
of their possible output:

How does generative AI work? At a high-level, generative models learn patterns in


data with the goal to produce new but similar data. Generative models are like the
following:

• Comedians who learn to imitate others by observing people's behaviors and


style of speaking
• Artists who learn to paint in a particular style by studying lots of paintings in
that style
• Cover bands that learn to sound like a specific music group by listening to lots
of music by that group

To produce unique and creative outputs, generative models are initially trained using
an unsupervised approach, where the model learns to mimic the data it's trained on.
The model is sometimes trained further using supervised or reinforcement learning
on specific data related to tasks the model might be asked to perform, for example,
summarize an article or edit a photo.

Generative AI is a quickly evolving technology with new use cases constantly being
discovered. For example, generative models are helping businesses refine their
ecommerce product images by automatically removing distracting backgrounds or
improving the quality of low-resolution images.

Supervised learning's tasks are well-defined and can be applied to a multitude of


scenarios—like identifying spam or predicting precipitation.

Foundational supervised learning concepts


Supervised machine learning is based on the following core concepts:

• Data
• Model
• Training
• Evaluating
• Inference

Data

Data is the driving force of ML. Data comes in the form of words and numbers
stored in tables, or as the values of pixels and waveforms captured in images and
audio files. We store related data in datasets. For example, we might have a dataset
of the following:

• Images of cats
• Housing prices
• Weather information

Datasets are made up of individual examples that contain features and a label. You
could think of an example as analogous to a single row in a spreadsheet. Features
are the values that a supervised model uses to predict the label. The label is the
"answer," or the value we want the model to predict. In a weather model that predicts
rainfall, the features could be latitude, longitude, temperature, humidity, cloud
coverage, wind direction, and atmospheric pressure. The label would be rainfall
amount.

Examples that contain both features and a label are called labeled examples.

Two labeled examples


In contrast, unlabeled examples contain features, but no label. After you create a
model, the model predicts the label from the features.

Two unlabeled examples

Dataset characteristics

A dataset is characterized by its size and diversity. Size indicates the number of
examples. Diversity indicates the range those examples cover. Good datasets are
both large and highly diverse.

Some datasets are both large and diverse. However, some datasets are large but
have low diversity, and some are small but highly diverse. In other words, a large
dataset doesn’t guarantee sufficient diversity, and a dataset that is highly diverse
doesn't guarantee sufficient examples.

For instance, a dataset might contain 100 years worth of data, but only for the month
of July. Using this dataset to predict rainfall in January would produce poor
predictions. Conversely, a dataset might cover only a few years but contain every
month. This dataset might produce poor predictions because it doesn't contain
enough years to account for variability.

A dataset can also be characterized by the number of its features. For example,
some weather datasets might contain hundreds of features, ranging from satellite
imagery to cloud coverage values. Other datasets might contain only three or four
features, like humidity, atmospheric pressure, and temperature. Datasets with more
features can help a model discover additional patterns and make better predictions.
However, datasets with more features don't always produce models that make better
predictions because some features might have no causal relationship to the label.

Model

In supervised learning, a model is the complex collection of numbers that define the
mathematical relationship from specific input feature patterns to specific output
label values. The model discovers these patterns through training.

Training

Before a supervised model can make predictions, it must be trained. To train a


model, we give the model a dataset with labeled examples. The model's goal is to
work out the best solution for predicting the labels from the features. The model
finds the best solution by comparing its predicted value to the label's actual value.
Based on the difference between the predicted and actual values—defined as
the loss—the model gradually updates its solution. In other words, the model learns
the mathematical relationship between the features and the label so that it can make
the best predictions on unseen data.

For example, if the model predicted 1.15 inches of rain, but the actual value was .75
inches, the model modifies its solution so its prediction is closer to .75 inches. After
the model has looked at each example in the dataset—in some cases, multiple
times—it arrives at a solution that makes the best predictions, on average, for each
of the examples.

The following demonstrates training a model:

1. The model takes in a single labeled example and provides a prediction.

Figure 1. An ML model making a prediction from a labeled example.

2. The model compares its predicted value with the actual value and updates its
solution.
Figure 2. An ML model updating its predicted value.

3. The model repeats this process for each labeled example in the dataset.

Figure 3. An ML model updating its predictions for each labeled example in


the training dataset.

In this way, the model gradually learns the correct relationship between the features
and the label. This gradual understanding is also why large and diverse datasets
produce a better model. The model has seen more data with a wider range of values
and has refined its understanding of the relationship between the features and the
label.

During training, ML practitioners can make subtle adjustments to the configurations


and features the model uses to make predictions. For example, certain features have
more predictive power than others. Therefore, ML practitioners can select which
features the model uses during training. For example, suppose a weather dataset
containstime_of_day as a feature. In this case, an ML practitioner can add or
remove time_of_day during training to see whether the model makes better
predictions with or without it.

Evaluating

We evaluate a trained model to determine how well it learned. When we evaluate a


model, we use a labeled dataset, but we only give the model the dataset's features.
We then compare the model's predictions to the label's true values.
Figure 4. Evaluating an ML model by comparing its predictions to the actual values.
Depending on the model's predictions, we might do more training and evaluating
before deploying the model in a real-world application.

The following questions help you solidify your understanding of core ML concepts.

Predictive power
Supervised ML models are trained using datasets with labeled examples. The model
learns how to predict the label from the features. However, not every feature in a
dataset has predictive power. In some instances, only a few features act as
predictors of the label. In the dataset below, use price as the label and the remaining
columns as the features.

Which three features do you think are likely the greatest predictors for a car's price?
Tire_size, wheel_base, year.
Make_model, year, miles.
A car's make/model, year, and miles are likely to be among the strongest predictors for its price.
Correct answer.
Color, height, make_model.
Miles, gearbox, make_model.

Supervised and unsupervised learning


Based on the problem, you'll use either a supervised or unsupervised approach. For
example, if you know beforehand the value or category you want to predict, you'd use
supervised learning. However, if you wanted to learn if your dataset contains any
segmentations or groupings of related examples, you'd use unsupervised learning.

Suppose you had a dataset of users for an online shopping website, and it contained
the following columns:

If you wanted to understand the types of users that visit the site, would you use
supervised or unsupervised learning?
Supervised learning because I'm trying to predict which class a user belongs to.
Unsupervised learning.
Because we want the model to cluster groups of related customers, we'd use unsupervised
learning. After the model clustered the users, we'd create our own names for each cluster, for
example, "discount seekers," "deal hunters," "surfers," "loyal," and "wanderers."
Correct answer.

Suppose you had an energy usage dataset for homes with the following columns:

What type of ML would you use to predict the kilowatt hours used per year for a
newly constructed house?
Unsupervised learning.
Unsupervised learning uses unlabeled examples. In this example, "kilowatt hours used per year”
would be the label because this is the value you want the model to predict.
Try again.
Supervised learning.
Supervised learning trains on labeled examples. In this dataset "kilowatt hours used per year”
would be the label because this is the value you want the model to predict. The features would
be "square footage,” "location,” and "year built.”
Correct answer.

Suppose you had a flight dataset with the following columns:

If you wanted to predict the cost of a coach ticket, would you use regression or
classification?
Regression
A regression model's output is a numeric value.
Correct answer.
Classification
Based on the dataset, could you train a classification model to classify the cost of a
coach ticket as "high," "average," or "low"?
No. It's not possible to create a classification model. The coach_ticket_cost values are
numeric not categorical.
Yes, but we'd first need to convert the numeric values in the coach_ticket_cost column
to categorical values.
It's possible to create a classification model from the dataset. You would do something like the
following:
1. Find the average cost of a ticket from the departure airport to the destination airport.

2. Determine the thresholds that would constitute "high," "average," and "low".

3. Compare the predicted cost to the thresholds and output the category the value falls
within.

Correct answer.
No. Classification models only predict two categories, like spam or not_spam. This
model would need to predict three categories.

Training and evaluating


After we've trained a model, we evaluate it by using a dataset with labeled examples
and compare the model's predicted value to the label's actual value.

Select the two best answers for the question.

If the model's predictions are far off, what might you do to make them better?
Retrain the model, but use only the features you believe have the strongest predictive
power for the label.
Retraining the model with fewer features, but that have more predictive power, can produce a
model that makes better predictions.
2 of 2 correct answers.
Try a different training approach. For example, if you used a supervised approach, try
an unsupervised approach.
Retrain the model using a larger and more diverse dataset.
Models trained on datasets with more examples and a wider range of values can produce better
predictions because the model has a better generalized solution for the relationship between the
features and the label.
1 of 2 correct answers.
You can't fix a model whose predictions are far off.

You're now ready to take the next step in your ML journey:

• People + AI Guidebook. If you're looking for a set of methods, best practices


and examples presented by Googlers, industry experts, and academic
research for using ML.
• Problem Framing. If you're looking for a field-tested approach for creating ML
models and avoiding common pitfalls along the way.
• Machine Learning Crash Course. If you're ready for an in-depth and hands-on
approach to learning more about ML.

Terminology: Labels and Features

• Label is the variable we're predicting


• Typically represented by the variable y
• Example is a particular instance of data, x
• Labeled example has {features, label}: (x, y)
• Used to train the model
• Unlabeled example has {features, ?}: (x, ?)
• Used for making predictions on new data
• Model maps examples to predicted labels: y'
• Defined by internal parameters, which are learned

What is (supervised) machine learning? Concisely put, it is the following:

• ML systems learn how to combine input to produce useful predictions on


never-before-seen data.

Let's explore fundamental machine learning terminology.

Labels
A label is the thing we're predicting—the y variable in simple linear regression. The
label could be the future price of wheat, the kind of animal shown in a picture, the
meaning of an audio clip, or just about anything.

Features
A feature is an input variable—the x variable in simple linear regression. A simple
machine learning project might use a single feature, while a more sophisticated
machine learning project could use millions of features, specified as:

𝑥1,𝑥2,...𝑥𝑁

In the spam detector example, the features could include the following:

• words in the email text


• sender's address
• time of day the email was sent
• email contains the phrase "one weird trick."
Examples
An example is a particular instance of data, x. (We put x in boldface to indicate that it
is a vector.) We break examples into two categories:

• labeled examples
• unlabeled examples

A labeled example includes both feature(s) and the label. That is:

labeled examples: {features, label}: (x, y)

Use labeled examples to train the model. In our spam detector example, the labeled
examples would be individual emails that users have explicitly marked as "spam" or
"not spam."

For example, the following table shows 5 labeled examples from a data
set containing information about housing prices in California:

housingMedianAge totalRooms totalBedrooms m


(feature) (feature) (feature) (
15 5612 1283 6
19 7650 1901 8
17 720 174 8
14 1501 337 7
20 1454 326 6

An unlabeled example contains features but not the label. That is:

unlabeled examples: {features, ?}: (x, ?)

Here are 3 unlabeled examples from the same housing dataset, which
exclude medianHouseValue:

housingMedianAge totalRooms total


(feature) (feature) (feat
42 1686 361
34 1226 180
33 1077 271

Once we've trained our model with labeled examples, we use that model to predict
the label on unlabeled examples. In the spam detector, unlabeled examples are new
emails that humans haven't yet labeled.
Models
A model defines the relationship between features and label. For example, a spam
detection model might associate certain features strongly with "spam". Let's
highlight two phases of a model's life:

• Training means creating or learning the model. That is, you show the model
labeled examples and enable the model to gradually learn the relationships
between features and label.
• Inference means applying the trained model to unlabeled examples. That is,
you use the trained model to make useful predictions (y'). For example, during
inference, you can predict medianHouseValue for new unlabeled examples.

Regression vs. classification


A regression model predicts continuous values. For example, regression models
make predictions that answer questions like the following:

• What is the value of a house in California?


• What is the probability that a user will click on this ad?

A classification model predicts discrete values. For example, classification models


make predictions that answer questions like the following:

• Is a given email message spam or not spam?


• Is this an image of a dog, a cat, or a hamster?

How do we reduce loss?

• Hyperparameters are the configuration settings used to tune how the


model is trained.
• Derivative of (y - y')2 with respect to the weights and biases tells us
how loss changes for a given example
• Simple to compute and convex
• So we repeatedly take small steps in the direction that minimizes
loss
• We call these Gradient Steps (But they're really negative
Gradient Steps)
• This strategy is called Gradient Descent
SGD & Mini-Batch Gradient Descent

• Could compute gradient over entire data set on each step, but this
turns out to be unnecessary
• Computing gradient on small data samples works well
• On every step, get a new random sample
• Stochastic Gradient Descent: one example at a time
• Mini-Batch Gradient Descent: batches of 10-1000
• Loss & gradients are averaged over the batch

In gradient descent, a batch is the set of examples you use to calculate the gradient
in a single training iteration. So far, we've assumed that the batch has been the entire
data set. When working at Google scale, data sets often contain billions or even
hundreds of billions of examples. Furthermore, Google data sets often contain huge
numbers of features. Consequently, a batch can be enormous. A very large batch
may cause even a single iteration to take a very long time to compute.

A large data set with randomly sampled examples probably contains redundant data.
In fact, redundancy becomes more likely as the batch size grows. Some redundancy
can be useful to smooth out noisy gradients, but enormous batches tend not to carry
much more predictive value than large batches.
What if we could get the right gradient on average for much less computation? By
choosing examples at random from our data set, we could estimate (albeit, noisily) a
big average from a much smaller one. Stochastic gradient descent (SGD) takes this
idea to the extreme--it uses only a single example (a batch size of 1) per iteration.
Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates
that the one example comprising each batch is chosen at random.

Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between


full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000
examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD
but is still more efficient than full-batch.

To simplify the explanation, we focused on gradient descent for a single feature.


Rest assured that gradient descent also works on feature sets that contain multiple
features.

When performing gradient descent on a large data set, which of the following batch
sizes will likely be more efficient?
A small batch or even a batch of one example (SGD).
Amazingly enough, performing gradient descent on a small batch or even a batch of one
example is usually more efficient than the full batch. After all, finding the gradient of
one example is far cheaper than finding the gradient of millions of examples. To ensure
a good representative sample, the algorithm scoops up another random small batch (or
batch of one) on every iteration.

NumPy and pandas


Using tf.keras requires at least a little understanding of the following two open-
source Python libraries:

• NumPy, which simplifies representing arrays and performing linear algebra


operations.
• pandas, which provides an easy way to represent datasets in memory.

If you are unfamiliar with NumPy or pandas, please begin by doing the following two
Colab exercises:

1. NumPy UltraQuick Tutorial Colab exercise, which provides all the NumPy
information you need for this course.
2. pandas UltraQuick Tutorial Colab exercise, which provides all the pandas
information you need for this course.
3. For example, if the batch size is 6, then the system recalculates the model's
loss value and adjusts the model's weights and bias after processing every 6
examples.
4. One epoch spans sufficient iterations to process every example in the
dataset. For example, if the batch size is 12, then each epoch lasts one
iteration. However, if the batch size is 6, then each epoch consumes two
iterations.
5. It is tempting to simply set the batch size to the number of examples in the
dataset (12, in this case). However, the model might actually train faster on
smaller batches. Conversely, very small batches might not contain enough
information to help the model converge.

Summary of hyperparameter tuning


Most machine learning problems require a lot of hyperparameter tuning.
Unfortunately, we can't provide concrete tuning rules for every model. Lowering the
learning rate can help one model converge efficiently but make another model
converge much too slowly. You must experiment to find the best set of
hyperparameters for your dataset. That said, here are a few rules of thumb:

• Training loss should steadily decrease, steeply at first, and then more slowly
until the slope of the curve reaches or approaches zero.
• If the training loss does not converge, train for more epochs.
• If the training loss decreases too slowly, increase the learning rate. Note that
setting the learning rate too high may also prevent training loss from
converging.
• If the training loss varies wildly (that is, the training loss jumps around),
decrease the learning rate.
• Lowering the learning rate while increasing the number of epochs or the batch
size is often a good combination.
• Setting the batch size to a very small batch number can also cause instability.
First, try large batch size values. Then, decrease the batch size until you see
degradation.
• For real-world datasets consisting of a very large number of examples, the
entire dataset might not fit into memory. In such cases, you'll need to reduce
the batch size to enable a batch to fit into memory.

Remember: the ideal combination of hyperparameters is data dependent, so you


must always experiment and verify.

Generalization refers to your model's ability to adapt properly to new, previously unseen data,
drawn from the same distribution as the one used to create the model.

Generalization: Peril of Overfitting


bookmark_border
Estimated Time: 10 minutes

This module focuses on generalization. In order to develop some intuition about this
concept, you're going to look at three figures. Assume that each dot in these figures
represents a tree's position in a forest. The two colors have the following meanings:

• The blue dots represent sick trees.

• The orange dots represent healthy trees.

With that in mind, take a look at Figure 1.

Figure 1. Sick (blue) and healthy (orange) trees.

Can you imagine a good model for predicting subsequent sick or healthy trees? Take
a moment to mentally draw an arc that divides the blues from the oranges, or
mentally lasso a batch of oranges or blues. Then, look at Figure 2, which shows how
a certain machine learning model separated the sick trees from the healthy trees.
Note that this model produced a very low loss.

Click the plus icon to see Figure 2.

Low loss, but still a bad model?


Figure 3 shows what happened when we added new data to the model. It turned out
that the model adapted very poorly to the new data. Notice that the model
miscategorized much of the new data.

Figure 3. The model did a bad job predicting new data.

The model shown in Figures 2 and 3 overfits the peculiarities of the data it trained
on. An overfit model gets a low loss during training but does a poor job predicting
new data. If a model fits the current sample well, how can we trust that it will make
good predictions on new data? As you'll see later on, overfitting is caused by making
a model more complex than necessary. The fundamental tension of machine
learning is between fitting our data well, but also fitting the data as simply as
possible.

Machine learning's goal is to predict well on new data drawn from a (hidden) true
probability distribution. Unfortunately, the model can't see the whole truth; the model
can only sample from a training data set. If a model fits the current examples well,
how can you trust the model will also make good predictions on never-before-seen
examples?

William of Ockham, a 14th century friar and philosopher, loved simplicity. He believed that
scientists should prefer simpler formulas or theories over more complex ones. To put
Ockham's razor in machine learning terms:

The less complex an ML model, the more likely that a good empirical result is not
just due to the peculiarities of the sample.
In modern times, we've formalized Ockham's razor into the fields of statistical
learning theory and computational learning theory. These fields have
developed generalization bounds--a statistical description of a model's ability to
generalize to new data based on factors such as:

• the complexity of the model

• the model's performance on training data

While the theoretical analysis provides formal guarantees under idealized


assumptions, they can be difficult to apply in practice. Machine Learning Crash
Course focuses instead on empirical evaluation to judge a model's ability to
generalize to new data.

A machine learning model aims to make good predictions on new, previously unseen
data. But if you are building a model from your data set, how would you get the
previously unseen data? Well, one way is to divide your data set into two subsets:

• training set—a subset to train a model.

• test set—a subset to test the model.

Good performance on the test set is a useful indicator of good performance on the
new data in general, assuming that:

• The test set is large enough.

• You don't cheat by using the same test set over and over.

The ML fine print


The following three basic assumptions guide generalization:

• We draw examples independently and identically (i.i.d) at random from the


distribution. In other words, examples don't influence each other. (An alternate
explanation: i.i.d. is a way of referring to the randomness of variables.)

• The distribution is stationary; that is the distribution doesn't change within the data
set.

• We draw examples from partitions from the same distribution.

In practice, we sometimes violate these assumptions. For example:

• Consider a model that chooses ads to display. The i.i.d. assumption would be
violated if the model bases its choice of ads, in part, on what ads the user has
previously seen.

• Consider a data set that contains retail sales information for a year. User's purchases
change seasonally, which would violate stationarity.

When we know that any of the preceding three basic assumptions are violated, we
must pay careful attention to metrics.
Training and Test Sets: Splitting Data
bookmark_border
Estimated Time: 8 minutes

The previous module introduced the idea of dividing your data set into two subsets:

• training set—a subset to train a model.

• test set—a subset to test the trained model.

You could imagine slicing the single data set as follows:

Figure 1. Slicing a single data set into a training set and test set.

Make sure that your test set meets the following two conditions:

• Is large enough to yield statistically meaningful results.

• Is representative of the data set as a whole. In other words, don't pick a test set with
different characteristics than the training set.

Assuming that your test set meets the preceding two conditions, your goal is to
create a model that generalizes well to new data. Our test set serves as a proxy for
new data. For example, consider the following figure. Notice that the model learned
for the training data is very simple. This model doesn't do a perfect job—a few
predictions are wrong. However, this model does about as well on the test data as it
does on the training data. In other words, this simple model does not overfit the
training data.

Figure 2. Validating the trained model against test data.

Never train on test data. If you are seeing surprisingly good results on your
evaluation metrics, it might be a sign that you are accidentally training on the test
set. For example, high accuracy might indicate that test data has leaked into the
training set.

For example, consider a model that predicts whether an email is spam, using the
subject line, email body, and sender's email address as features. We apportion the
data into training and test sets, with an 80-20 split. After training, the model achieves
99% precision on both the training set and the test set. We'd expect a lower precision
on the test set, so we take another look at the data and discover that many of the
examples in the test set are duplicates of examples in the training set (we neglected
to scrub duplicate entries for the same spam email from our input database before
splitting the data). We've inadvertently trained on some of our test data, and as a
result, we're no longer accurately measuring how well our model generalizes to new
data.

Validation Set
Partitioning a data set into a training set and test set lets you
judge whether a given model will generalize well to new data.
However, using only two partitions may be insufficient when
doing many rounds of hyperparameter tuning.

Validation Set: Another Partition


bookmark_border
Estimated Time: 8 minutes

The previous module introduced partitioning a data set into a training set and a test
set. This partitioning enabled you to train on one set of examples and then to test the
model against a different set of examples. With two partitions, the workflow could
look as follows:

Figure 1. A possible workflow?

In the figure, "Tweak model" means adjusting anything about the model you can
dream up—from changing the learning rate, to adding or removing features, to
designing a completely new model from scratch. At the end of this workflow, you
pick the model that does best on the test set.

Dividing the data set into two sets is a good idea, but not a panacea. You can greatly
reduce your chances of overfitting by partitioning the data set into the three subsets
shown in the following figure:

Figure 2. Slicing a single data set into three subsets.

Use the validation set to evaluate results from the training set. Then, use the test set
to double-check your evaluation after the model has "passed" the validation set. The
following figure shows this new workflow:
Figure 3. A better workflow.

In this improved workflow:

1. Pick the model that does best on the validation set.

2. Double-check that model against the test set.

This is a better workflow because it creates fewer exposures to the test set.

Regularization for Simplicity: L₂ Regularization


bookmark_border
Estimated Time: 7 minutes

Consider the following generalization curve, which shows the loss for both the
training set and validation set against the number of training iterations.

Figure 1. Loss on training set and validation set.

Figure 1 shows a model in which training loss gradually decreases, but validation
loss eventually goes up. In other words, this generalization curve shows that the
model is overfitting to the data in the training set. Channeling our inner Ockham,
perhaps we could prevent overfitting by penalizing complex models, a principle
called regularization.

In other words, instead of simply aiming to minimize loss (empirical risk


minimization):

minimize(Loss(Data|Model))

we'll now minimize loss+complexity, which is called structural risk minimization:

minimize(Loss(Data|Model) + complexity(Model))

Our training optimization algorithm is now a function of two terms: the loss term,
which measures how well the model fits the data, and the regularization term, which
measures model complexity.

Machine Learning Crash Course focuses on two common (and somewhat related)
ways to think of model complexity:

• Model complexity as a function of the weights of all the features in the model.

• Model complexity as a function of the total number of features with nonzero weights.
(A later module covers this approach.)
If model complexity is a function of weights, a feature weight with a high absolute
value is more complex than a feature weight with a low absolute value.

We can quantify complexity using the L2 regularization formula, which defines the
regularization term as the sum of the squares of all the feature weights:

𝐿2 regularization term=||𝑤||22=𝑤12+𝑤22+...+𝑤𝑛2

In this formula, weights close to zero have little effect on model complexity, while
outlier weights can have a huge impact.

Regularization for Simplicity: Lambda


bookmark_border
Estimated Time: 8 minutes

Model developers tune the overall impact of the regularization term by multiplying its
value by a scalar known as lambda (also called the regularization rate). That is,
model developers aim to do the following:

minimize(Loss(Data|Model)+𝜆 complexity(Model))

Performing L2 regularization has the following effect on a model

• Encourages weight values toward 0 (but not exactly 0)

• Encourages the mean of the weights toward 0, with a normal (bell-shaped or


Gaussian) distribution.

Increasing the lambda value strengthens the regularization effect. For example, the
histogram of weights for a high value of lambda might look as shown in Figure 2.

Figure 2. Histogram of weights.

Lowering the value of lambda tends to yield a flatter histogram, as shown in Figure 3.

Figure 3. Histogram of weights produced by a lower lambda value.

When choosing a lambda value, the goal is to strike the right balance between
simplicity and training-data fit:

• If your lambda value is too high, your model will be simple, but you run the risk
of underfitting your data. Your model won't learn enough about the training
data to make useful predictions.
• If your lambda value is too low, your model will be more complex, and you run
the risk of overfitting your data. Your model will learn too much about the
particularities of the training data, and won't be able to generalize to new
data.
Note: Setting lambda to zero removes regularization completely. In this case, training
focuses exclusively on minimizing loss, which poses the highest possible overfitting risk.

The ideal value of lambda produces a model that generalizes well to new, previously
unseen data. Unfortunately, that ideal value of lambda is data-dependent, so you'll
need to do some tuning.

Logistic Regression: Calculating a Probability


bookmark_border
Estimated Time: 10 minutes

Many problems require a probability estimate as output. Logistic regression is an


extremely efficient mechanism for calculating probabilities. Practically speaking, you
can use the returned probability in either of the following two ways:

• "As is"

• Converted to a binary category.

Let's consider how we might use the probability "as is." Suppose we create a logistic
regression model to predict the probability that a dog will bark during the middle of
the night. We'll call that probability:

𝑝(𝑏𝑎𝑟𝑘|𝑛𝑖𝑔ℎ𝑡)

If the logistic regression model predicts 𝑝(𝑏𝑎𝑟𝑘|𝑛𝑖𝑔ℎ𝑡)=0.05, then over a year, the
dog's owners should be startled awake approximately 18 times:
𝑠𝑡𝑎𝑟𝑡𝑙𝑒𝑑=𝑝(𝑏𝑎𝑟𝑘|𝑛𝑖𝑔ℎ𝑡)⋅𝑛𝑖𝑔ℎ𝑡𝑠=0.05⋅365 =18

In many cases, you'll map the logistic regression output into the solution to a binary
classification problem, in which the goal is to correctly predict one of two possible
labels (e.g., "spam" or "not spam"). A later module focuses on that.

You might be wondering how a logistic regression model can ensure output that
always falls between 0 and 1. As it happens, a sigmoid function, defined as follows,
produces output having those same characteristics:

𝑦=11+𝑒−𝑧

The sigmoid function yields the following plot:


Figure 1: Sigmoid function.

If 𝑧 represents the output of the linear layer of a model trained with logistic
regression, then 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑧) will yield a value (a probability) between 0 and 1. In
mathematical terms:
𝑦′=11+𝑒−𝑧

where:

• 𝑦′ is the output of the logistic regression model for a particular example.


• 𝑧=𝑏+𝑤1𝑥1+𝑤2𝑥2+…+𝑤𝑁𝑥𝑁
• The 𝑤 values are the model's learned weights, and 𝑏 is the bias.
• The 𝑥 values are the feature values for a particular example.
Note that 𝑧 is also referred to as the log-odds because the inverse of the sigmoid
states that 𝑧 can be defined as the log of the probability of the 1 label (e.g., "dog
barks") divided by the probability of the 0 label (e.g., "dog doesn't bark"):
𝑧=log⁡(𝑦1−𝑦)

Here is the sigmoid function with ML labels:

Figure 2: Logistic regression output.

Logistic Regression: Loss and Regularization


bookmark_border
Estimated Time: 6 minutes

Loss function for Logistic Regression


The loss function for linear regression is squared loss. The loss function for logistic
regression is Log Loss, which is defined as follows:

Log Loss=∑(𝑥,𝑦)∈𝐷−𝑦log⁡(𝑦′)−(1−𝑦)log⁡(1−𝑦′)

where:

• (𝑥,𝑦)∈𝐷 is the data set containing many labeled examples, which are (𝑥,𝑦) pairs.
• 𝑦 is the label in a labeled example. Since this is logistic regression, every value
of 𝑦 must either be 0 or 1.
• 𝑦′ is the predicted value (somewhere between 0 and 1), given the set of features in 𝑥.

Regularization in Logistic Regression


Regularization is extremely important in logistic regression modeling. Without
regularization, the asymptotic nature of logistic regression would keep driving loss
towards 0 in high dimensions. Consequently, most logistic regression models use
one of the following two strategies to dampen model complexity:

• L2 regularization.

• Early stopping, that is, limiting the number of training steps or the learning rate.

(We'll discuss a third strategy—L1 regularization—in a later module.)

Imagine that you assign a unique id to each example, and map each id to its own
feature. If you don't specify a regularization function, the model will become
completely overfit. That's because the model would try to drive loss to zero on all
examples and never get there, driving the weights for each indicator feature to
+infinity or -infinity. This can happen in high dimensional data with feature crosses,
when there’s a huge mass of rare crosses that happen only on one example each.

Fortunately, using L2 or early stopping will prevent this problem.

Summary

• Logistic regression models generate probabilities.

• Log Loss is the loss function for logistic regression.

• Logistic regression is widely used by many practitioners.

Classification: Thresholding
bookmark_border
Estimated Time: 2 minutes

Logistic regression returns a probability. You can use the returned probability "as is"
(for example, the probability that the user will click on this ad is 0.00023) or convert
the returned probability to a binary value (for example, this email is spam).
A logistic regression model that returns 0.9995 for a particular email message is
predicting that it is very likely to be spam. Conversely, another email message with a
prediction score of 0.0003 on that same logistic regression model is very likely not
spam. However, what about an email message with a prediction score of 0.6? In
order to map a logistic regression value to a binary category, you must define
a classification threshold (also called the decision threshold). A value above that
threshold indicates "spam"; a value below indicates "not spam." It is tempting to
assume that the classification threshold should always be 0.5, but thresholds are
problem-dependent, and are therefore values that you must tune.

The following sections take a closer look at metrics you can use to evaluate a
classification model's predictions, as well as the impact of changing the
classification threshold on these predictions.

Note: "Tuning" a threshold for logistic regression is different from tuning hyperparameters
such as learning rate. Part of choosing a threshold is assessing how much you'll suffer for
making a mistake. For example, mistakenly labeling a non-spam message as spam is very
bad. However, mistakenly labeling a spam message as non-spam is unpleasant, but hardly
the end of your job.

Classification: True vs. False and Positive vs. Negative


bookmark_border
Estimated Time: 5 minutes

In this section, we'll define the primary building blocks of the metrics we'll use to
evaluate classification models. But first, a fable:

An Aesop's Fable: The Boy Who Cried Wolf (compressed)

A shepherd boy gets bored tending the town's flock. To have some fun, he cries out,
"Wolf!" even though no wolf is in sight. The villagers run to protect the flock, but then
get really mad when they realize the boy was playing a joke on them.

[Iterate previous paragraph N times.]

One night, the shepherd boy sees a real wolf approaching the flock and calls out,
"Wolf!" The villagers refuse to be fooled again and stay in their houses. The hungry
wolf turns the flock into lamb chops. The town goes hungry. Panic ensues.

Let's make the following definitions:

• "Wolf" is a positive class.

• "No wolf" is a negative class.

We can summarize our "wolf-prediction" model using a 2x2 confusion matrix that
depicts all four possible outcomes:
True Positive (TP): False Positive (FP):

• Reality: A wolf threatened. • Reality: No wolf threatened.

• Shepherd said: "Wolf." • Shepherd said: "Wolf."

• Outcome: Shepherd is a hero. • Outcome: Villagers are angry at shepherd for waking
them up.
False Negative (FN): True Negative (TN):

• Reality: A wolf threatened. • Reality: No wolf threatened.

• Shepherd said: "No wolf." • Shepherd said: "No wolf."

• Outcome: The wolf ate all the sheep. • Outcome: Everyone is fine.

A true positive is an outcome where the model correctly predicts the positive class.
Similarly, a true negative is an outcome where the model correctly predicts
the negative class.

A false positive is an outcome where the model incorrectly predicts


the positive class. And a false negative is an outcome where the
model incorrectly predicts the negative class.

In the following sections, we'll look at how to evaluate classification models using
metrics derived from these four outcomes.

Classification: Accuracy

Estimated Time: 6 minutes

Accuracy is one metric for evaluating classification models. Informally, accuracy is


the fraction of predictions our model got right. Formally, accuracy has the following
definition:

Accuracy=Number of correct predictionsTotal number of predictions

For binary classification, accuracy can also be calculated in terms of positives and
negatives as follows:

Accuracy=𝑇𝑃+𝑇𝑁𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False


Negatives.

Let's try calculating accuracy for the following model that classified 100 tumors
as malignant (the positive class) or benign (the negative class):
True Positive (TP): False Positive (FP):

• Reality: Malignant • Reality: Benign

• ML model predicted: Malignant • ML model predicted: Malignant

• Number of TP results: 1 • Number of FP results: 1


False Negative (FN): True Negative (TN):

• Reality: Malignant • Reality: Benign

• ML model predicted: Benign • ML model predicted: Benign

• Number of FN results: 8 • Number of TN results: 90


Accuracy=𝑇𝑃+𝑇𝑁𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁=1+901+90+1+8=0.91

Accuracy comes out to 0.91, or 91% (91 correct predictions out of 100 total
examples). That means our tumor classifier is doing a great job of identifying
malignancies, right?

Actually, let's do a closer analysis of positives and negatives to gain more insight
into our model's performance.

Of the 100 tumor examples, 91 are benign (90 TNs and 1 FP) and 9 are malignant (1
TP and 8 FNs).

Of the 91 benign tumors, the model correctly identifies 90 as benign. That's good.
However, of the 9 malignant tumors, the model only correctly identifies 1 as
malignant—a terrible outcome, as 8 out of 9 malignancies go undiagnosed!

While 91% accuracy may seem good at first glance, another tumor-classifier model
that always predicts benign would achieve the exact same accuracy (91/100 correct
predictions) on our examples. In other words, our model is no better than one that
has zero predictive ability to distinguish malignant tumors from benign tumors.

Accuracy alone doesn't tell the full story when you're working with a class-
imbalanced data set, like this one, where there is a significant disparity between the
number of positive and negative labels.

In the next section, we'll look at two better metrics for evaluating class-imbalanced
problems: precision and recall.

Precision
Precision attempts to answer the following question:

What proportion of positive identifications was actually correct?

Precision is defined as follows:


Precision=𝑇𝑃𝑇𝑃+𝐹𝑃
Note: A model that produces no false positives has a precision of 1.0.

Let's calculate precision for our ML model from the previous section that analyzes
tumors:

True Positives (TPs): 1 False Positives (FPs): 1


False Negatives (FNs): 8 True Negatives (TNs): 90
Precision=𝑇𝑃𝑇𝑃+𝐹𝑃=11+1=0.5

Our model has a precision of 0.5—in other words, when it predicts a tumor is
malignant, it is correct 50% of the time.

Recall
Recall attempts to answer the following question:

What proportion of actual positives was identified correctly?

Mathematically, recall is defined as follows:

Recall=𝑇𝑃𝑇𝑃+𝐹𝑁
Note: A model that produces no false negatives has a recall of 1.0.

Let's calculate recall for our tumor classifier:

True Positives (TPs): 1 False Positives (FPs): 1


False Negatives (FNs): 8 True Negatives (TNs): 90
Recall=𝑇𝑃𝑇𝑃+𝐹𝑁=11+8=0.11

Our model has a recall of 0.11—in other words, it correctly identifies 11% of all
malignant tumors.

Precision and Recall: A Tug of War


To fully evaluate the effectiveness of a model, you must examine both precision and
recall. Unfortunately, precision and recall are often in tension. That is, improving
precision typically reduces recall and vice versa. Explore this notion by looking at the
following figure, which shows 30 predictions made by an email classification model.
Those to the right of the classification threshold are classified as "spam", while
those to the left are classified as "not spam."
Figure 1. Classifying email messages as spam or not spam.

Let's calculate precision and recall based on the results shown in Figure 1:

True Positives (TP): 8 False Positives (FP): 2


False Negatives (FN): 3 True Negatives (TN): 17

Precision measures the percentage of emails flagged as spam that were correctly
classified—that is, the percentage of dots to the right of the threshold line that are
green in Figure 1:

Precision=𝑇𝑃𝑇𝑃+𝐹𝑃=88+2=0.8

Recall measures the percentage of actual spam emails that were correctly
classified—that is, the percentage of green dots that are to the right of the threshold
line in Figure 1:

Recall=𝑇𝑃𝑇𝑃+𝐹𝑁=88+3=0.73

Figure 2 illustrates the effect of increasing the classification threshold.

Figure 2. Increasing classification threshold.

The number of false positives decreases, but false negatives increase. As a result,
precision increases, while recall decreases:

True Positives (TP): 7 False Positives (FP): 1


False Negatives (FN): 4 True Negatives (TN): 18
Precision=𝑇𝑃𝑇𝑃+𝐹𝑃=77+1=0.88
Recall=𝑇𝑃𝑇𝑃+𝐹𝑁=77+4=0.64

Conversely, Figure 3 illustrates the effect of decreasing the classification threshold


(from its original position in Figure 1).

Figure 3. Decreasing classification threshold.

False positives increase, and false negatives decrease. As a result, this time,
precision decreases and recall increases:
True Positives (TP): 9 False Positives (FP): 3
False Negatives (FN): 2 True Negatives (TN): 16
Precision=𝑇𝑃𝑇𝑃+𝐹𝑃=99+3=0.75
Recall=𝑇𝑃𝑇𝑃+𝐹𝑁=99+2=0.82

Various metrics have been developed that rely on both precision and recall. For
example, see F1 score.

Classification: ROC Curve and AUC


bookmark_border
Estimated Time: 8 minutes

ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds. This curve
plots two parameters:

• True Positive Rate

• False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

𝑇𝑃𝑅=𝑇𝑃𝑇𝑃+𝐹𝑁

False Positive Rate (FPR) is defined as follows:

𝐹𝑃𝑅=𝐹𝑃𝐹𝑃+𝑇𝑁

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the
classification threshold classifies more items as positive, thus increasing both False
Positives and True Positives. The following figure shows a typical ROC curve.

Figure 4. TP vs. FP rate at different classification thresholds.

To compute the points in an ROC curve, we could evaluate a logistic regression


model many times with different classification thresholds, but this would be
inefficient. Fortunately, there's an efficient, sorting-based algorithm that can provide
this information for us, called AUC.

AUC: Area Under the ROC Curve


AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-
dimensional area underneath the entire ROC curve (think integral calculus) from (0,0)
to (1,1).

Figure 5. AUC (Area under the ROC Curve).

AUC provides an aggregate measure of performance across all possible


classification thresholds. One way of interpreting AUC is as the probability that the
model ranks a random positive example more highly than a random negative
example. For example, given the following examples, which are arranged from left to
right in ascending order of logistic regression predictions:

Figure 6. Predictions ranked in ascending order of logistic regression score.

AUC represents the probability that a random positive (green) example is positioned
to the right of a random negative (red) example.

AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an
AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

AUC is desirable for the following two reasons:

• AUC is scale-invariant. It measures how well predictions are ranked, rather than their
absolute values.

• AUC is classification-threshold-invariant. It measures the quality of the model's


predictions irrespective of what classification threshold is chosen.

However, both these reasons come with caveats, which may limit the usefulness of
AUC in certain use cases:

• Scale invariance is not always desirable. For example, sometimes we really


do need well calibrated probability outputs, and AUC won’t tell us about that.
• Classification-threshold invariance is not always desirable. In cases where
there are wide disparities in the cost of false negatives vs. false positives, it
may be critical to minimize one type of classification error. For example, when
doing email spam detection, you likely want to prioritize minimizing false
positives (even if that results in a significant increase of false negatives). AUC
isn't a useful metric for this type of optimization.

Classification: Prediction Bias


bookmark_border
Estimated Time: 7 minutes

Logistic regression predictions should be unbiased. That is:

"average of predictions" should ≈ "average of observations"

Prediction bias is a quantity that measures how far apart those two averages are.
That is:

prediction bias=average of predictions−average of labels in data set

Note: "Prediction bias" is a different quantity than bias (the b in wx + b).

A significant nonzero prediction bias tells you there is a bug somewhere in your
model, as it indicates that the model is wrong about how frequently positive labels
occur.

For example, let's say we know that on average, 1% of all emails are spam. If we
don't know anything at all about a given email, we should predict that it's 1% likely to
be spam. Similarly, a good spam model should predict on average that emails are 1%
likely to be spam. (In other words, if we average the predicted likelihoods of each
individual email being spam, the result should be 1%.) If instead, the model's average
prediction is 20% likelihood of being spam, we can conclude that it exhibits
prediction bias.

Possible root causes of prediction bias are:

• Incomplete feature set

• Noisy data set

• Buggy pipeline

• Biased training sample

• Overly strong regularization

You might be tempted to correct prediction bias by post-processing the learned


model—that is, by adding a calibration layer that adjusts your model's output to
reduce the prediction bias. For example, if your model has +3% bias, you could add a
calibration layer that lowers the mean prediction by 3%. However, adding a
calibration layer is a bad idea for the following reasons:

• You're fixing the symptom rather than the cause.

• You've built a more brittle system that you must now keep up to date.

If possible, avoid calibration layers. Projects that use calibration layers tend to
become reliant on them—using calibration layers to fix all their model's sins.
Ultimately, maintaining the calibration layers can become a nightmare.

Note: A good model will usually have near-zero bias. That said, a low prediction bias does
not prove that your model is good. A really terrible model could have a zero prediction bias.
For example, a model that just predicts the mean value for all examples would be a bad
model, despite having zero bias.

Bucketing and Prediction Bias


Logistic regression predicts a value between 0 and 1. However, all labeled examples
are either exactly 0 (meaning, for example, "not spam") or exactly 1 (meaning, for
example, "spam"). Therefore, when examining prediction bias, you cannot accurately
determine the prediction bias based on only one example; you must examine the
prediction bias on a "bucket" of examples. That is, prediction bias for logistic
regression only makes sense when grouping enough examples together to be able to
compare a predicted value (for example, 0.392) to observed values (for example,
0.394).

You can form buckets in the following ways:

• Linearly breaking up the target predictions.

• Forming quantiles.

Consider the following calibration plot from a particular model. Each dot represents
a bucket of 1,000 values. The axes have the following meanings:

• The x-axis represents the average of values the model predicted for that bucket.

• The y-axis represents the actual average of values in the data set for that bucket.

Both axes are logarithmic scales.

Figure 8. Prediction bias curve (logarithmic scales)

Why are the predictions so poor for only part of the model? Here are a few
possibilities:

• The training set doesn't adequately represent certain subsets of the data space.

• Some subsets of the data set are noisier than others.

• The model is overly regularized. (Consider reducing the value of lambda.)

Classification: Prediction Bias


bookmark_border
Estimated Time: 7 minutes

Logistic regression predictions should be unbiased. That is:


"average of predictions" should ≈ "average of observations"

Prediction bias is a quantity that measures how far apart those two averages are.
That is:

prediction bias=average of predictions−average of labels in data set

Note: "Prediction bias" is a different quantity than bias (the b in wx + b).

A significant nonzero prediction bias tells you there is a bug somewhere in your
model, as it indicates that the model is wrong about how frequently positive labels
occur.

For example, let's say we know that on average, 1% of all emails are spam. If we
don't know anything at all about a given email, we should predict that it's 1% likely to
be spam. Similarly, a good spam model should predict on average that emails are 1%
likely to be spam. (In other words, if we average the predicted likelihoods of each
individual email being spam, the result should be 1%.) If instead, the model's average
prediction is 20% likelihood of being spam, we can conclude that it exhibits
prediction bias.

Possible root causes of prediction bias are:

• Incomplete feature set

• Noisy data set

• Buggy pipeline

• Biased training sample

• Overly strong regularization

You might be tempted to correct prediction bias by post-processing the learned


model—that is, by adding a calibration layer that adjusts your model's output to
reduce the prediction bias. For example, if your model has +3% bias, you could add a
calibration layer that lowers the mean prediction by 3%. However, adding a
calibration layer is a bad idea for the following reasons:

• You're fixing the symptom rather than the cause.

• You've built a more brittle system that you must now keep up to date.

If possible, avoid calibration layers. Projects that use calibration layers tend to
become reliant on them—using calibration layers to fix all their model's sins.
Ultimately, maintaining the calibration layers can become a nightmare.

Note: A good model will usually have near-zero bias. That said, a low prediction bias does
not prove that your model is good. A really terrible model could have a zero prediction bias.
For example, a model that just predicts the mean value for all examples would be a bad
model, despite having zero bias.
Bucketing and Prediction Bias
Logistic regression predicts a value between 0 and 1. However, all labeled examples
are either exactly 0 (meaning, for example, "not spam") or exactly 1 (meaning, for
example, "spam"). Therefore, when examining prediction bias, you cannot accurately
determine the prediction bias based on only one example; you must examine the
prediction bias on a "bucket" of examples. That is, prediction bias for logistic
regression only makes sense when grouping enough examples together to be able to
compare a predicted value (for example, 0.392) to observed values (for example,
0.394).

You can form buckets in the following ways:

• Linearly breaking up the target predictions.

• Forming quantiles.

Consider the following calibration plot from a particular model. Each dot represents
a bucket of 1,000 values. The axes have the following meanings:

• The x-axis represents the average of values the model predicted for that bucket.

• The y-axis represents the actual average of values in the data set for that bucket.

Both axes are logarithmic scales.

Figure 8. Prediction bias curve (logarithmic scales)

Why are the predictions so poor for only part of the model? Here are a few
possibilities:

• The training set doesn't adequately represent certain subsets of the data space.

• Some subsets of the data set are noisier than others.

• The model is overly regularized. (Consider reducing the value of lambda.)

Neural Networks: Structure


bookmark_border
Estimated Time: 7 minutes

If you recall from the Feature Crosses unit, the following classification problem is
nonlinear:
Figure 1. Nonlinear classification problem.

"Nonlinear" means that you can't accurately predict a label with a model of the
form 𝑏+𝑤1𝑥1+𝑤2𝑥2 In other words, the "decision surface" is not a line. Previously,
we looked at feature crosses as one possible approach to modeling nonlinear
problems.

Now consider the following data set:

Figure 2. A more difficult nonlinear classification problem.

The data set shown in Figure 2 can't be solved with a linear model.

To see how neural networks might help with nonlinear problems, let's start by
representing a linear model as a graph:
Figure 3. Linear model as graph.

Each blue circle represents an input feature, and the green circle represents the
weighted sum of the inputs.

How can we alter this model to improve its ability to deal with nonlinear problems?

Hidden Layers
In the model represented by the following graph, we've added a "hidden layer" of
intermediary values. Each yellow node in the hidden layer is a weighted sum of the
blue input node values. The output is a weighted sum of the yellow nodes.

Figure 4. Graph of two-layer model.

Is this model linear? Yes—its output is still a linear combination of its inputs.

In the model represented by the following graph, we've added a second hidden layer
of weighted sums.

Figure 5. Graph of three-layer model.

Is this model still linear? Yes, it is. When you express the output as a function of the
input and simplify, you get just another weighted sum of the inputs. This sum won't
effectively model the nonlinear problem in Figure 2.
Activation Functions
To model a nonlinear problem, we can directly introduce a nonlinearity. We can pipe
each hidden layer node through a nonlinear function.

In the model represented by the following graph, the value of each node in Hidden
Layer 1 is transformed by a nonlinear function before being passed on to the
weighted sums of the next layer. This nonlinear function is called the activation
function.

Figure 6. Graph of three-layer model with activation function.

Now that we've added an activation function, adding layers has more impact.
Stacking nonlinearities on nonlinearities lets us model very complicated
relationships between the inputs and the predicted outputs. In brief, each layer is
effectively learning a more complex, higher-level function over the raw inputs. If
you'd like to develop more intuition on how this works, see Chris Olah's excellent
blog post.

Common Activation Functions

The following sigmoid activation function converts the weighted sum to a value
between 0 and 1.

𝐹(𝑥)=11+𝑒−𝑥

Here's a plot:

Figure 7. Sigmoid activation function.

The following rectified linear unit activation function (or ReLU, for short) often works
a little better than a smooth function like the sigmoid, while also being significantly
easier to compute.

𝐹(𝑥)=𝑚𝑎𝑥(0,𝑥)

The superiority of ReLU is based on empirical findings, probably driven by ReLU


having a more useful range of responsiveness. A sigmoid's responsiveness falls off
relatively quickly on both sides.

Figure 8. ReLU activation function.


In fact, any mathematical function can serve as an activation function. Suppose
that 𝜎 represents our activation function (Relu, Sigmoid, or whatever). Consequently,
the value of a node in the network is given by the following formula:
𝜎(𝑤⋅𝑥+𝑏)

TensorFlow provides out-of-the-box support for many activation functions. You can
find these activation functions within TensorFlow's list of wrappers for primitive
neural network operations. That said, we still recommend starting with ReLU.

Summary
Now our model has all the standard components of what people usually mean when
they say "neural network":

• A set of nodes, analogous to neurons, organized in layers.

• A set of weights representing the connections between each neural network layer
and the layer beneath it. The layer beneath may be another neural network layer, or
some other kind of layer.

• A set of biases, one for each node.

• An activation function that transforms the output of each node in a layer. Different
layers may have different activation functions.

Training Neural Networks


bookmark_border
Backpropagation is the most common training algorithm for neural networks. It
makes gradient descent feasible for multi-layer neural networks. TensorFlow
handles backpropagation automatically, so you don't need a deep understanding of
the algorithm. To get a sense of how it works, walk through the
following: Backpropagation algorithm visual explanation. As you scroll through the
preceding explanation, note the following:

• How data flows through the graph.

• How dynamic programming lets us avoid computing exponentially many paths


through the graph. Here "dynamic programming" just means recording intermediate
results on the forward and backward passes.

Backprop: What You Need To Know

• Gradients are important


• If it's differentiable, we can probably learn on it
• Gradients can vanish
• Each additional layer can successively reduce signal vs. noise
• ReLus are useful here
• Gradients can explode
• Learning rates are important here
• Batch normalization (useful knob) can help
• ReLu layers can die
• Keep calm and lower your learning rates

Normalizing Feature Values

• We'd like our features to have reasonable scales


• Roughly zero-centered, [-1, 1] range often works well
• Helps gradient descent converge; avoid NaN trap
• Avoiding outlier values can also help
• Can use a few standard methods:
• Linear scaling
• Hard cap (clipping) to max, min
• Log scaling

Dropout Regularization

• Dropout: Another form of regularization, useful for NNs


• Works by randomly "dropping out" units in a network for a single
gradient step
• There's a connection to ensemble models here
• The more you drop out, the stronger the regularization
• 0.0 = no dropout regularization
• 1.0 = drop everything out! learns nothing
• Intermediate values more useful

Training Neural Networks: Best Practices


bookmark_border
Estimated Time: 5 minutes

This section explains backpropagation's failure cases and the most common way to
regularize a neural network.

Failure Cases
There are a number of common ways for backpropagation to go wrong.

Vanishing Gradients

The gradients for the lower layers (closer to the input) can become very small. In
deep networks, computing these gradients can involve taking the product of many
small terms.

When the gradients vanish toward 0 for the lower layers, these layers train very
slowly, or not at all.

The ReLU activation function can help prevent vanishing gradients.

Exploding Gradients

If the weights in a network are very large, then the gradients for the lower layers
involve products of many large terms. In this case you can have exploding gradients:
gradients that get too large to converge.

Batch normalization can help prevent exploding gradients, as can lowering the
learning rate.

Dead ReLU Units

Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. It
outputs 0 activation, contributing nothing to the network's output, and gradients can
no longer flow through it during backpropagation. With a source of gradients cut off,
the input to the ReLU may not ever change enough to bring the weighted sum back
above 0.

Lowering the learning rate can help keep ReLU units from dying.
Dropout Regularization
Yet another form of regularization, called Dropout, is useful for neural networks. It
works by randomly "dropping out" unit activations in a network for a single gradient
step. The more you drop out, the stronger the regularization:

• 0.0 = No dropout regularization.

• 1.0 = Drop out everything. The model learns nothing.

• Values between 0.0 and 1.0 = More useful.

Multi-Class Neural Networks: One vs. All


bookmark_border
Estimated Time: 2 minutes

One vs. all provides a way to leverage binary classification. Given a classification
problem with N possible solutions, a one-vs.-all solution consists of N separate
binary classifiers—one binary classifier for each possible outcome. During training,
the model runs through a sequence of binary classifiers, training each to answer a
separate classification question. For example, given a picture of a dog, five different
recognizers might be trained, four seeing the image as a negative example (not an
apple, not a bear, etc.) and one seeing the image as a positive example (a dog). That
is:

1. Is this image an apple? No.

2. Is this image a bear? No.

3. Is this image candy? No.

4. Is this image a dog? Yes.

5. Is this image an egg? No.

This approach is fairly reasonable when the total number of classes is small, but
becomes increasingly inefficient as the number of classes rises.

We can create a significantly more efficient one-vs.-all model with a deep neural
network in which each output node represents a different class. The following figure
suggests this approach:

Multi-Class Neural Networks: Softmax


bookmark_border
Estimated Time: 8 minutes
Recall that logistic regression produces a decimal between 0 and 1.0. For example, a
logistic regression output of 0.8 from an email classifier suggests an 80% chance of
an email being spam and a 20% chance of it being not spam. Clearly, the sum of the
probabilities of an email being either spam or not spam is 1.0.

Softmax extends this idea into a multi-class world. That is, Softmax assigns decimal
probabilities to each class in a multi-class problem. Those decimal probabilities
must add up to 1.0. This additional constraint helps training converge more quickly
than it otherwise would.

For example, returning to the image analysis we saw in Figure 1, Softmax might
produce the following likelihoods of an image belonging to a particular class:

Class Probability

apple 0.001

bear 0.04

candy 0.008

dog 0.95

egg 0.001

Softmax is implemented through a neural network layer just before the output layer.
The Softmax layer must have the same number of nodes as the output layer.

Figure 2. A Softmax layer within a neural network.

Click the plus icon to see the Softmax equation.

Softmax Options
Consider the following variants of Softmax:

• Full Softmax is the Softmax we've been discussing; that is, Softmax
calculates a probability for every possible class.
• Candidate sampling means that Softmax calculates a probability for all the
positive labels but only for a random sample of negative labels. For example,
if we are interested in determining whether an input image is a beagle or a
bloodhound, we don't have to provide probabilities for every non-doggy
example.
Full Softmax is fairly cheap when the number of classes is small but becomes
prohibitively expensive when the number of classes climbs. Candidate sampling can
improve efficiency in problems having a large number of classes.

One Label vs. Many Labels


Softmax assumes that each example is a member of exactly one class. Some
examples, however, can simultaneously be a member of multiple classes. For such
examples:

• You may not use Softmax.

• You must rely on multiple logistic regressions.

For example, suppose your examples are images containing exactly one item—a
piece of fruit. Softmax can determine the likelihood of that one item being a pear, an
orange, an apple, and so on. If your examples are images containing all sorts of
things—bowls of different kinds of fruit—then you'll have to use multiple logistic
regressions instead.

Embeddings
bookmark_border
An embedding is a relatively low-dimensional space into which you can translate
high-dimensional vectors. Embeddings make it easier to do machine learning on
large inputs like sparse vectors representing words. Ideally, an embedding captures
some of the semantics of the input by placing semantically similar inputs close
together in the embedding space. An embedding can be learned and reused across
models.

Embeddings: Motivation From Collaborative Filtering


bookmark_border
Estimated Time: 10 minutes

Collaborative filtering is the task of making predictions about the interests of a user
based on interests of many other users. As an example, let's look at the task of
movie recommendation. Suppose we have 500,000 users, and a list of the movies
each user has watched (from a catalog of 1,000,000 movies). Our goal is to
recommend movies to users.

To solve this problem some method is needed to determine which movies are
similar to each other. We can achieve this goal by embedding the movies into a low-
dimensional space created such that similar movies are nearby.
Before describing how we can learn the embedding, we first explore the type of
qualities we want the embedding to have, and how we will represent the training data
for learning the embedding.

Arrange Movies on a One-Dimensional Number Line


To help develop intuition about embeddings, on a piece of paper, try to arrange the
following movies on a one-dimensional number line so that the movies nearest each
other are the most closely related:

Movie Rating Description

Bleu R A French widow grieves the loss of her husband and daughter after they perish in a car
accident.

The Dark Knight Rises PG-13 Batman endeavors to save Gotham City from nuclear annihilation in this sequel to The
Dark Knight, set in the DC Comics universe.

Harry Potter and the PG A orphaned boy discovers he is a wizard and enrolls in Hogwarts School of Witchcraft
Sorcerer's Stone and Wizardry, where he wages his first battle against the evil Lord Voldemort.

The Incredibles PG A family of superheroes forced to live as civilians in suburbia come out of retirement
to save the superhero race from Syndrome and his killer robot.

Shrek PG A lovable ogre and his donkey sidekick set off on a mission to rescue Princess Fiona,
who is emprisoned in her castle by a dragon.

Star Wars PG Luke Skywalker and Han Solo team up with two androids to rescue Princess Leia and
save the galaxy.

The Triplets of PG-13 When professional cycler Champion is kidnapped during the Tour de France, his
Belleville grandmother and overweight dog journey overseas to rescue him, with the help of a
trio of elderly jazz singers.

Memento R An amnesiac desperately seeks to solve his wife's murder by tattooing clues onto his
body.

Embeddings: Categorical Input Data


bookmark_border
Estimated Time: 10 minutes

Categorical data refers to input features that represent one or more discrete items
from a finite set of choices. For example, it can be the set of movies a user has
watched, the set of words in a document, or the occupation of a person.
Categorical data is most efficiently represented via sparse tensors, which are
tensors with very few non-zero elements. For example, if we're building a movie
recommendation model, we can assign a unique ID to each possible movie, and then
represent each user by a sparse tensor of the movies they have watched, as shown
in Figure 3.

Figure 3. Data for our movie recommendation problem.

Each row of the matrix in Figure 3 is an example capturing a user's movie-viewing


history, and is represented as a sparse tensor because each user only watches a
small fraction of all possible movies. The last row corresponds to the sparse tensor
[1, 3, 999999], using the vocabulary indices shown above the movie icons.

Likewise one can represent words, sentences, and documents as sparse vectors
where each word in the vocabulary plays a role similar to the movies in our
recommendation example.

In order to use such representations within a machine learning system, we need a


way to represent each sparse vector as a vector of numbers so that semantically
similar items (movies or words) have similar distances in the vector space. But how
do you represent a word as a vector of numbers?

The simplest way is to define a giant input layer with a node for every word in your
vocabulary, or at least a node for every word that appears in your data. If 500,000
unique words appear in your data, you could represent a word with a length 500,000
vector and assign each word to a slot in the vector.

If you assign "horse" to index 1247, then to feed "horse" into your network you might
copy a 1 into the 1247th input node and 0s into all the rest. This sort of
representation is called a one-hot encoding, because only one index has a non-zero
value.
More typically your vector might contain counts of the words in a larger chunk of
text. This is known as a "bag of words" representation. In a bag-of-words vector,
several of the 500,000 nodes would have non-zero value.

But however you determine the non-zero values, one-node-per-word gives you
very sparse input vectors—very large vectors with relatively few non-zero values.
Sparse representations have a couple of problems that can make it hard for a model
to learn effectively.

Size of Network
Huge input vectors mean a super-huge number of weights for a neural network. If
there are M words in your vocabulary and N nodes in the first layer of the network
above the input, you have MxN weights to train for that layer. A large number of
weights causes further problems:

• Amount of data. The more weights in your model, the more data you need to
train effectively.
• Amount of computation. The more weights, the more computation required to
train and use the model. It's easy to exceed the capabilities of your hardware.

Lack of Meaningful Relations Between Vectors


If you feed the pixel values of RGB channels into an image classifier, it makes sense
to talk about "close" values. Reddish blue is close to pure blue, both semantically and
in terms of the geometric distance between vectors. But a vector with a 1 at index
1247 for "horse" is not any closer to a vector with a 1 at index 50,430 for "antelope"
than it is to a vector with a 1 at index 238 for "television".

The Solution: Embeddings


The solution to these problems is to use embeddings, which translate large sparse
vectors into a lower-dimensional space that preserves semantic relationships. We'll
explore embeddings intuitively, conceptually, and programmatically in the following
sections of this module.

Embeddings: Obtaining Embeddings


bookmark_border
Estimated Time: 10 minutes

There are a number of ways to get an embedding, including a state-of-the-art


algorithm created at Google.

Standard Dimensionality Reduction Techniques


There are many existing mathematical techniques for capturing the important
structure of a high-dimensional space in a low dimensional space. In theory, any of
these techniques could be used to create an embedding for a machine learning
system.

For example, principal component analysis (PCA) has been used to create word
embeddings. Given a set of instances like bag of words vectors, PCA tries to find
highly correlated dimensions that can be collapsed into a single dimension.

Word2vec
Word2vec is an algorithm invented at Google for training word embeddings.
Word2vec relies on the distributional hypothesis to map semantically similar words
to geometrically close embedding vectors.

The distributional hypothesis states that words which often have the same
neighboring words tend to be semantically similar. Both "dog" and "cat" frequently
appear close to the word "veterinarian", and this fact reflects their semantic
similarity. As the linguist John Firth put it in 1957, "You shall know a word by the
company it keeps".

Word2Vec exploits contextual information like this by training a neural net to


distinguish actually co-occurring groups of words from randomly grouped words.
The input layer takes a sparse representation of a target word together with one or
more context words. This input connects to a single, smaller hidden layer.

In one version of the algorithm, the system makes a negative example by


substituting a random noise word for the target word. Given the positive example
"the plane flies", the system might swap in "jogging" to create the contrasting
negative example "the jogging flies".

The other version of the algorithm creates negative examples by pairing the true
target word with randomly chosen context words. So it might take the positive
examples (the, plane), (flies, plane) and the negative examples (compiled, plane),
(who, plane) and learn to identify which pairs actually appeared together in text.

The classifier is not the real goal for either version of the system, however. After the
model has been trained, you have an embedding. You can use the weights
connecting the input layer with the hidden layer to map sparse representations of
words to smaller vectors. This embedding can be reused in other classifiers.

For more information about word2vec, see the tutorial on tensorflow.org

Training an Embedding as Part of a Larger Model


You can also learn an embedding as part of the neural network for your target task.
This approach gets you an embedding well customized for your particular system,
but may take longer than training the embedding separately.
In general, when you have sparse data (or dense data that you'd like to embed), you
can create an embedding unit that is just a special type of hidden unit of size d. This
embedding layer can be combined with any other features and hidden layers. As in
any DNN, the final layer will be the loss that is being optimized. For example, let's say
we're performing collaborative filtering, where the goal is to predict a user's interests
from the interests of other users. We can model this as a supervised learning
problem by randomly setting aside (or holding out) a small number of the movies
that the user has watched as the positive labels, and then optimize a softmax loss.

Figure 5. A sample DNN architecture for learning movie embeddings from


collaborative filtering data.

As another example if you want to create an embedding layer for the words in a real-
estate ad as part of a DNN to predict housing prices then you'd optimize an L2 Loss
using the known sale price of homes in your training data as the label.

When learning a d-dimensional embedding each item is mapped to a point in a d-


dimensional space so that the similar items are nearby in this space. Figure 6 helps
to illustrate the relationship between the weights learned in the embedding layer and
the geometric view. The edge weights between an input node and the nodes in the d-
dimensional embedding layer correspond to the coordinate values for each of
the d axes.
Figure 6. A geometric view of the embedding layer weights.

Production ML Systems

System-Level Components

• No, you don't have to build everything yourself.


• Re-use generic ML system components wherever possible.
• Google CloudML solutions include Dataflow and TF Serving
• Components can also be found in other platforms like Spark,
Hadoop, etc.
• How do you know what you need?
• Understand a few ML system paradigms & their
requirements

Static vs. Dynamic Training


bookmark_border
Broadly speaking, there are two ways to train a model:

• A static model is trained offline. That is, we train the model exactly once and then use
that trained model for a while.

• A dynamic model is trained online. That is, data is continually entering the system
and we're incorporating that data into the model through continuous updates.

• Identify the pros and cons of static and dynamic training.

Broadly speaking, the following points dominate the static vs. dynamic training
decision:

• Static models are easier to build and test.

• Dynamic models adapt to changing data. The world is a highly changeable place.
Sales predictions built from last year's data are unlikely to successfully predict next
year's results.

If your data set truly isn't changing over time, choose static training because it is
cheaper to create and maintain than dynamic training. However, many information
sources really do change over time, even those with features that you think are as
constant as, say, sea level. The moral: even with static training, you must still
monitor your input data for change.

For example, consider a model trained to predict the probability that users will buy
flowers. Because of time pressure, the model is trained only once using a dataset of
flower buying behavior during July and August. The model is then shipped off to
serve predictions in production, but is never updated. The model works fine for
several months, but then makes terrible predictions around Valentine's Day because
user behavior during that holiday period changes dramatically.
Data Dependencies

Reliability

Some questions to ask about the reliability of your input data:

• Is the signal always going to be available or is it coming from an unreliable


source? For example:
• Is the signal coming from a server that crashes under heavy load?
• Is the signal coming from humans that go on vacation every August?

Versioning

Some questions to ask about versioning:

• Does the system that computes this data ever change? If so:
• How often?
• How will you know when that system changes?

Sometimes, data comes from an upstream process. If that process changes


abruptly, your model can suffer.

Consider creating your own copy of the data you receive from the upstream process.
Then, only advance to the next version of the upstream data when you are certain
that it is safe to do so.

Necessity

The following question might remind you of regularization:

• Does the usefulness of the feature justify the cost of including it?

It is always tempting to add more features to the model. For example, suppose you
find a new feature whose addition makes your model slightly more accurate. More
accuracy certainly sounds better than less accuracy. However, now you've just added
to your maintenance burden. That additional feature could degrade unexpectedly, so
you've got to monitor it. Think carefully before adding features that lead to minor
short-term wins.

Correlations

Some features correlate (positively or negatively) with other features. Ask yourself
the following question:
• Are any features so tied together that you need additional strategies to tease
them apart?

Feedback Loops

Sometimes a model can affect its own training data. For example, the results from
some models, in turn, are directly or indirectly input features to that same model.

Sometimes a model can affect another model. For example, consider two models for
predicting stock prices:

• Model A, which is a bad predictive model.


• Model B.

Since Model A is buggy, it mistakenly decides to buy stock in Stock X. Those


purchases drive up the price of Stock X. Model B uses the price of Stock X as an
input feature, so Model B can easily come to some false conclusions about the value
of Stock X stock. Model B could, therefore, buy or sell shares of Stock X based on
the buggy behavior of Model A. Model B's behavior, in turn, can affect Model A,
possibly triggering a tulip mania or a slide in Company X's stock

Fairness
Fairness: Types of Bias
bookmark_border
Estimated Time: 5 minutes

Machine learning models are not inherently objective. Engineers train models by
feeding them a data set of training examples, and human involvement in the
provision and curation of this data can make a model's predictions susceptible to
bias.

When building models, it's important to be aware of common human biases that can
manifest in your data, so you can take proactive steps to mitigate their effects.

WARNING: The following inventory of biases provides just a small selection of biases that
are often uncovered in machine learning data sets; this list is not intended to be exhaustive.
Wikipedia's catalog of cognitive biases enumerates over 100 different types of human bias
that can affect our judgment. When auditing your data, you should be on the lookout for any
and all potential sources of bias that might skew your model's predictions.

Reporting Bias
Reporting bias occurs when the frequency of events, properties, and/or outcomes
captured in a data set does not accurately reflect their real-world frequency. This
bias can arise because people tend to focus on documenting circumstances that are
unusual or especially memorable, assuming that the ordinary can "go without
saying."

EXAMPLE: A sentiment-analysis model is trained to predict whether book reviews are


positive or negative based on a corpus of user submissions to a popular website. The
majority of reviews in the training data set reflect extreme opinions (reviewers who either
loved or hated a book), because people were less likely to submit a review of a book if they
did not respond to it strongly. As a result, the model is less able to correctly predict
sentiment of reviews that use more subtle language to describe a book.

Automation Bias
Automation bias is a tendency to favor results generated by automated systems
over those generated by non-automated systems, irrespective of the error rates of
each.

EXAMPLE: Software engineers working for a sprocket manufacturer were eager to deploy
the new "groundbreaking" model they trained to identify tooth defects, until the factory
supervisor pointed out that the model's precision and recall rates were both 15% lower than
those of human inspectors.

Selection Bias
Selection bias occurs if a data set's examples are chosen in a way that is not
reflective of their real-world distribution. Selection bias can take many different
forms:

• Coverage bias: Data is not selected in a representative fashion.

EXAMPLE: A model is trained to predict future sales of a new product based on phone
surveys conducted with a sample of consumers who bought the product. Consumers who
instead opted to buy a competing product were not surveyed, and as a result, this group of
people was not represented in the training data.

• Non-response bias (or participation bias): Data ends up being unrepresentative due
to participation gaps in the data-collection process.

EXAMPLE: A model is trained to predict future sales of a new product based on phone
surveys conducted with a sample of consumers who bought the product and with a sample
of consumers who bought a competing product. Consumers who bought the competing
product were 80% more likely to refuse to complete the survey, and their data was
underrepresented in the sample.

• Sampling bias: Proper randomization is not used during data collection.

EXAMPLE: A model is trained to predict future sales of a new product based on phone
surveys conducted with a sample of consumers who bought the product and with a sample
of consumers who bought a competing product. Instead of randomly targeting consumers,
the surveyer chose the first 200 consumers that responded to an email, who might have
been more enthusiastic about the product than average purchasers.

Group Attribution Bias


Group attribution bias is a tendency to generalize what is true of individuals to an
entire group to which they belong. Two key manifestations of this bias are:

• In-group bias: A preference for members of a group to which you also belong, or for
characteristics that you also share.

EXAMPLE: Two engineers training a résumé-screening model for software developers are
predisposed to believe that applicants who attended the same computer-science academy
as they both did are more qualified for the role.

• Out-group homogeneity bias: A tendency to stereotype individual members of a


group to which you do not belong, or to see their characteristics as more uniform.

EXAMPLE: Two engineers training a résumé-screening model for software developers are
predisposed to believe that all applicants who did not attend a computer-science academy
do not have sufficient expertise for the role.

Implicit Bias
Implicit bias occurs when assumptions are made based on one's own mental
models and personal experiences that do not necessarily apply more generally.

EXAMPLE: An engineer training a gesture-recognition model uses a head shake as a feature


to indicate a person is communicating the word "no." However, in some regions of the world,
a head shake actually signifies "yes."

A common form of implicit bias is confirmation bias, where model builders


unconsciously process data in ways that affirm preexisting beliefs and hypotheses.
In some cases, a model builder may actually keep training a model until it produces a
result that aligns with their original hypothesis; this is called experimenter's bias.

EXAMPLE: An engineer is building a model that predicts aggressiveness in dogs based on a


variety of features (height, weight, breed, environment). The engineer had an unpleasant
encounter with a hyperactive toy poodle as a child, and ever since has associated the breed
with aggression. When the trained model predicted most toy poodles to be relatively docile,
the engineer retrained the model several more times until it produced a result showing
smaller poodles to be more violent.Key Terms

You might also like