ML Notes
ML Notes
ML is the process of training a piece of software, called a model, to make useful predictions or
generate content from data.
Types of ML Systems
ML systems fall into one or more of the following categories based on how they
learn to make predictions or generate content:
• Supervised learning
• Unsupervised learning
• Reinforcement learning
• Generative AI
Supervised learning
Supervised learning models can make predictions after seeing lots of data with the
correct answers and then discovering the connections between the elements in the
data that produce the correct answers. This is like a student learning new material by
studying old exams that contain both questions and answers. Once the student has
trained on enough old exams, the student is well prepared to take a new exam.
These ML systems are "supervised" in the sense that a human gives the ML system
data with the known correct results.
Two of the most common use cases for supervised learning are regression and
classification.
ML systems learn
on never-before-seen data
Regression
A regression model predicts a numeric value. For example, a weather model that
predicts the amount of rain, in inches or millimeters, is a regression model.
Future Square footage, zip code, number of bedrooms and bathrooms, The price of the home.
house price lot size, mortgage interest rate, property tax rate, construction
costs, and number of homes for sale in the area.
Future ride Historical traffic conditions (gathered from smartphones, traffic The time in minutes and
time sensors, ride-hailing and other navigation applications), seconds to arrive at a
distance from destination, and weather conditions. destination.
Classification
Classification models are divided into two groups: binary classification and
multiclass classification. Binary classification models output a value from a class
that contains only two values, for example, a model that outputs either rain or no rain.
Multiclass classification models output a value from a class that contains more than
two values, for example, a model that can output either rain, hail, snow, or sleet.
Unsupervised learning
Unsupervised learning models make predictions by being given data that does not
contain any correct answers. An unsupervised learning model's goal is to identify
meaningful patterns among the data. In other words, the model has no hints on how
to categorize each piece of data, but instead it must infer its own rules.
Clustering differs from classification because the categories aren't defined by you. For
example, an unsupervised model might cluster a weather dataset based on temperature,
revealing segmentations that define the seasons. You might then attempt to name those
clusters based on your understanding of the dataset.
What distinguishes a supervised approach from an unsupervised approach?
A supervised approach is given data that contains the correct answer.
A supervised approach is given data that contains the correct answer. The model's job is to find
connections in the data that produce the correct answer. An unsupervised approach is given
data without the correct answer. Its job is to find groupings in the data.
Reinforcement learning
Reinforcement learning models make predictions by getting rewards or penalties
based on actions performed within an environment. A reinforcement learning system
generates a policy that defines the best strategy for getting the most rewards.
Reinforcement learning is used to train robots to perform tasks, like walking around
a room, and software programs like AlphaGo to play the game of Go.
Generative AI
Generative AI is a class of models that creates content from user input. For example,
generative AI can create unique images, music compositions, and jokes; it can
summarize articles, explain how to perform a task, or edit a photo.
Model Input Example output
Text- Who invented the Le Mans race? The 24 Hours of Le Mans was founded by the Automobile Club de l'Ouest (ACO),
to-text a motoring organization founded in 1906. The first race was held on 26 and 27 May
1923 on the Sarthe Circuit, a 13.426-kilometre (8.345-mile) road circuit located
near the town of Le Mans in France. The race was originally conceived as a way to
promote the ACO and the Sarthe region.
Source: Bard
Source: Imagen
Text- Write a Python loop that loops over a for number in numbers:
to-code list of numbers and prints the prime # Check if the number is prime.
numbers. is_prime = True
for i in range(2, number):
if number % i == 0:
is_prime = False
break
# If the number is prime, print it.
if is_prime:
print(number)
Source: Bard
Model Input Example output
Generative AI can take a variety of inputs and create a variety of outputs, like text,
images, audio, and video. It can also take and create combinations of these. For
example, a model can take an image as input and create an image and text as
output, or take an image and text as input and create a video as output.
We can discuss generative models by their inputs and outputs, typically written as
"type of input"-to-"type of output." For example, the following is a partial list of some
inputs and outputs for generative models:
• Text-to-text
• Text-to-image
• Text-to-video
• Text-to-code
• Text-to-speech
• Image and text-to-image
The following table list examples of generative models, their input, and an example
of their possible output:
To produce unique and creative outputs, generative models are initially trained using
an unsupervised approach, where the model learns to mimic the data it's trained on.
The model is sometimes trained further using supervised or reinforcement learning
on specific data related to tasks the model might be asked to perform, for example,
summarize an article or edit a photo.
Generative AI is a quickly evolving technology with new use cases constantly being
discovered. For example, generative models are helping businesses refine their
ecommerce product images by automatically removing distracting backgrounds or
improving the quality of low-resolution images.
• Data
• Model
• Training
• Evaluating
• Inference
Data
Data is the driving force of ML. Data comes in the form of words and numbers
stored in tables, or as the values of pixels and waveforms captured in images and
audio files. We store related data in datasets. For example, we might have a dataset
of the following:
• Images of cats
• Housing prices
• Weather information
Datasets are made up of individual examples that contain features and a label. You
could think of an example as analogous to a single row in a spreadsheet. Features
are the values that a supervised model uses to predict the label. The label is the
"answer," or the value we want the model to predict. In a weather model that predicts
rainfall, the features could be latitude, longitude, temperature, humidity, cloud
coverage, wind direction, and atmospheric pressure. The label would be rainfall
amount.
Examples that contain both features and a label are called labeled examples.
Dataset characteristics
A dataset is characterized by its size and diversity. Size indicates the number of
examples. Diversity indicates the range those examples cover. Good datasets are
both large and highly diverse.
Some datasets are both large and diverse. However, some datasets are large but
have low diversity, and some are small but highly diverse. In other words, a large
dataset doesn’t guarantee sufficient diversity, and a dataset that is highly diverse
doesn't guarantee sufficient examples.
For instance, a dataset might contain 100 years worth of data, but only for the month
of July. Using this dataset to predict rainfall in January would produce poor
predictions. Conversely, a dataset might cover only a few years but contain every
month. This dataset might produce poor predictions because it doesn't contain
enough years to account for variability.
A dataset can also be characterized by the number of its features. For example,
some weather datasets might contain hundreds of features, ranging from satellite
imagery to cloud coverage values. Other datasets might contain only three or four
features, like humidity, atmospheric pressure, and temperature. Datasets with more
features can help a model discover additional patterns and make better predictions.
However, datasets with more features don't always produce models that make better
predictions because some features might have no causal relationship to the label.
Model
In supervised learning, a model is the complex collection of numbers that define the
mathematical relationship from specific input feature patterns to specific output
label values. The model discovers these patterns through training.
Training
For example, if the model predicted 1.15 inches of rain, but the actual value was .75
inches, the model modifies its solution so its prediction is closer to .75 inches. After
the model has looked at each example in the dataset—in some cases, multiple
times—it arrives at a solution that makes the best predictions, on average, for each
of the examples.
2. The model compares its predicted value with the actual value and updates its
solution.
Figure 2. An ML model updating its predicted value.
3. The model repeats this process for each labeled example in the dataset.
In this way, the model gradually learns the correct relationship between the features
and the label. This gradual understanding is also why large and diverse datasets
produce a better model. The model has seen more data with a wider range of values
and has refined its understanding of the relationship between the features and the
label.
Evaluating
The following questions help you solidify your understanding of core ML concepts.
Predictive power
Supervised ML models are trained using datasets with labeled examples. The model
learns how to predict the label from the features. However, not every feature in a
dataset has predictive power. In some instances, only a few features act as
predictors of the label. In the dataset below, use price as the label and the remaining
columns as the features.
Which three features do you think are likely the greatest predictors for a car's price?
Tire_size, wheel_base, year.
Make_model, year, miles.
A car's make/model, year, and miles are likely to be among the strongest predictors for its price.
Correct answer.
Color, height, make_model.
Miles, gearbox, make_model.
Suppose you had a dataset of users for an online shopping website, and it contained
the following columns:
If you wanted to understand the types of users that visit the site, would you use
supervised or unsupervised learning?
Supervised learning because I'm trying to predict which class a user belongs to.
Unsupervised learning.
Because we want the model to cluster groups of related customers, we'd use unsupervised
learning. After the model clustered the users, we'd create our own names for each cluster, for
example, "discount seekers," "deal hunters," "surfers," "loyal," and "wanderers."
Correct answer.
Suppose you had an energy usage dataset for homes with the following columns:
What type of ML would you use to predict the kilowatt hours used per year for a
newly constructed house?
Unsupervised learning.
Unsupervised learning uses unlabeled examples. In this example, "kilowatt hours used per year”
would be the label because this is the value you want the model to predict.
Try again.
Supervised learning.
Supervised learning trains on labeled examples. In this dataset "kilowatt hours used per year”
would be the label because this is the value you want the model to predict. The features would
be "square footage,” "location,” and "year built.”
Correct answer.
If you wanted to predict the cost of a coach ticket, would you use regression or
classification?
Regression
A regression model's output is a numeric value.
Correct answer.
Classification
Based on the dataset, could you train a classification model to classify the cost of a
coach ticket as "high," "average," or "low"?
No. It's not possible to create a classification model. The coach_ticket_cost values are
numeric not categorical.
Yes, but we'd first need to convert the numeric values in the coach_ticket_cost column
to categorical values.
It's possible to create a classification model from the dataset. You would do something like the
following:
1. Find the average cost of a ticket from the departure airport to the destination airport.
2. Determine the thresholds that would constitute "high," "average," and "low".
3. Compare the predicted cost to the thresholds and output the category the value falls
within.
Correct answer.
No. Classification models only predict two categories, like spam or not_spam. This
model would need to predict three categories.
If the model's predictions are far off, what might you do to make them better?
Retrain the model, but use only the features you believe have the strongest predictive
power for the label.
Retraining the model with fewer features, but that have more predictive power, can produce a
model that makes better predictions.
2 of 2 correct answers.
Try a different training approach. For example, if you used a supervised approach, try
an unsupervised approach.
Retrain the model using a larger and more diverse dataset.
Models trained on datasets with more examples and a wider range of values can produce better
predictions because the model has a better generalized solution for the relationship between the
features and the label.
1 of 2 correct answers.
You can't fix a model whose predictions are far off.
Labels
A label is the thing we're predicting—the y variable in simple linear regression. The
label could be the future price of wheat, the kind of animal shown in a picture, the
meaning of an audio clip, or just about anything.
Features
A feature is an input variable—the x variable in simple linear regression. A simple
machine learning project might use a single feature, while a more sophisticated
machine learning project could use millions of features, specified as:
𝑥1,𝑥2,...𝑥𝑁
In the spam detector example, the features could include the following:
• labeled examples
• unlabeled examples
A labeled example includes both feature(s) and the label. That is:
Use labeled examples to train the model. In our spam detector example, the labeled
examples would be individual emails that users have explicitly marked as "spam" or
"not spam."
For example, the following table shows 5 labeled examples from a data
set containing information about housing prices in California:
An unlabeled example contains features but not the label. That is:
Here are 3 unlabeled examples from the same housing dataset, which
exclude medianHouseValue:
Once we've trained our model with labeled examples, we use that model to predict
the label on unlabeled examples. In the spam detector, unlabeled examples are new
emails that humans haven't yet labeled.
Models
A model defines the relationship between features and label. For example, a spam
detection model might associate certain features strongly with "spam". Let's
highlight two phases of a model's life:
• Training means creating or learning the model. That is, you show the model
labeled examples and enable the model to gradually learn the relationships
between features and label.
• Inference means applying the trained model to unlabeled examples. That is,
you use the trained model to make useful predictions (y'). For example, during
inference, you can predict medianHouseValue for new unlabeled examples.
• Could compute gradient over entire data set on each step, but this
turns out to be unnecessary
• Computing gradient on small data samples works well
• On every step, get a new random sample
• Stochastic Gradient Descent: one example at a time
• Mini-Batch Gradient Descent: batches of 10-1000
• Loss & gradients are averaged over the batch
In gradient descent, a batch is the set of examples you use to calculate the gradient
in a single training iteration. So far, we've assumed that the batch has been the entire
data set. When working at Google scale, data sets often contain billions or even
hundreds of billions of examples. Furthermore, Google data sets often contain huge
numbers of features. Consequently, a batch can be enormous. A very large batch
may cause even a single iteration to take a very long time to compute.
A large data set with randomly sampled examples probably contains redundant data.
In fact, redundancy becomes more likely as the batch size grows. Some redundancy
can be useful to smooth out noisy gradients, but enormous batches tend not to carry
much more predictive value than large batches.
What if we could get the right gradient on average for much less computation? By
choosing examples at random from our data set, we could estimate (albeit, noisily) a
big average from a much smaller one. Stochastic gradient descent (SGD) takes this
idea to the extreme--it uses only a single example (a batch size of 1) per iteration.
Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates
that the one example comprising each batch is chosen at random.
When performing gradient descent on a large data set, which of the following batch
sizes will likely be more efficient?
A small batch or even a batch of one example (SGD).
Amazingly enough, performing gradient descent on a small batch or even a batch of one
example is usually more efficient than the full batch. After all, finding the gradient of
one example is far cheaper than finding the gradient of millions of examples. To ensure
a good representative sample, the algorithm scoops up another random small batch (or
batch of one) on every iteration.
If you are unfamiliar with NumPy or pandas, please begin by doing the following two
Colab exercises:
1. NumPy UltraQuick Tutorial Colab exercise, which provides all the NumPy
information you need for this course.
2. pandas UltraQuick Tutorial Colab exercise, which provides all the pandas
information you need for this course.
3. For example, if the batch size is 6, then the system recalculates the model's
loss value and adjusts the model's weights and bias after processing every 6
examples.
4. One epoch spans sufficient iterations to process every example in the
dataset. For example, if the batch size is 12, then each epoch lasts one
iteration. However, if the batch size is 6, then each epoch consumes two
iterations.
5. It is tempting to simply set the batch size to the number of examples in the
dataset (12, in this case). However, the model might actually train faster on
smaller batches. Conversely, very small batches might not contain enough
information to help the model converge.
• Training loss should steadily decrease, steeply at first, and then more slowly
until the slope of the curve reaches or approaches zero.
• If the training loss does not converge, train for more epochs.
• If the training loss decreases too slowly, increase the learning rate. Note that
setting the learning rate too high may also prevent training loss from
converging.
• If the training loss varies wildly (that is, the training loss jumps around),
decrease the learning rate.
• Lowering the learning rate while increasing the number of epochs or the batch
size is often a good combination.
• Setting the batch size to a very small batch number can also cause instability.
First, try large batch size values. Then, decrease the batch size until you see
degradation.
• For real-world datasets consisting of a very large number of examples, the
entire dataset might not fit into memory. In such cases, you'll need to reduce
the batch size to enable a batch to fit into memory.
Generalization refers to your model's ability to adapt properly to new, previously unseen data,
drawn from the same distribution as the one used to create the model.
This module focuses on generalization. In order to develop some intuition about this
concept, you're going to look at three figures. Assume that each dot in these figures
represents a tree's position in a forest. The two colors have the following meanings:
Can you imagine a good model for predicting subsequent sick or healthy trees? Take
a moment to mentally draw an arc that divides the blues from the oranges, or
mentally lasso a batch of oranges or blues. Then, look at Figure 2, which shows how
a certain machine learning model separated the sick trees from the healthy trees.
Note that this model produced a very low loss.
The model shown in Figures 2 and 3 overfits the peculiarities of the data it trained
on. An overfit model gets a low loss during training but does a poor job predicting
new data. If a model fits the current sample well, how can we trust that it will make
good predictions on new data? As you'll see later on, overfitting is caused by making
a model more complex than necessary. The fundamental tension of machine
learning is between fitting our data well, but also fitting the data as simply as
possible.
Machine learning's goal is to predict well on new data drawn from a (hidden) true
probability distribution. Unfortunately, the model can't see the whole truth; the model
can only sample from a training data set. If a model fits the current examples well,
how can you trust the model will also make good predictions on never-before-seen
examples?
William of Ockham, a 14th century friar and philosopher, loved simplicity. He believed that
scientists should prefer simpler formulas or theories over more complex ones. To put
Ockham's razor in machine learning terms:
The less complex an ML model, the more likely that a good empirical result is not
just due to the peculiarities of the sample.
In modern times, we've formalized Ockham's razor into the fields of statistical
learning theory and computational learning theory. These fields have
developed generalization bounds--a statistical description of a model's ability to
generalize to new data based on factors such as:
A machine learning model aims to make good predictions on new, previously unseen
data. But if you are building a model from your data set, how would you get the
previously unseen data? Well, one way is to divide your data set into two subsets:
Good performance on the test set is a useful indicator of good performance on the
new data in general, assuming that:
• You don't cheat by using the same test set over and over.
• The distribution is stationary; that is the distribution doesn't change within the data
set.
• Consider a model that chooses ads to display. The i.i.d. assumption would be
violated if the model bases its choice of ads, in part, on what ads the user has
previously seen.
• Consider a data set that contains retail sales information for a year. User's purchases
change seasonally, which would violate stationarity.
When we know that any of the preceding three basic assumptions are violated, we
must pay careful attention to metrics.
Training and Test Sets: Splitting Data
bookmark_border
Estimated Time: 8 minutes
The previous module introduced the idea of dividing your data set into two subsets:
Figure 1. Slicing a single data set into a training set and test set.
Make sure that your test set meets the following two conditions:
• Is representative of the data set as a whole. In other words, don't pick a test set with
different characteristics than the training set.
Assuming that your test set meets the preceding two conditions, your goal is to
create a model that generalizes well to new data. Our test set serves as a proxy for
new data. For example, consider the following figure. Notice that the model learned
for the training data is very simple. This model doesn't do a perfect job—a few
predictions are wrong. However, this model does about as well on the test data as it
does on the training data. In other words, this simple model does not overfit the
training data.
Never train on test data. If you are seeing surprisingly good results on your
evaluation metrics, it might be a sign that you are accidentally training on the test
set. For example, high accuracy might indicate that test data has leaked into the
training set.
For example, consider a model that predicts whether an email is spam, using the
subject line, email body, and sender's email address as features. We apportion the
data into training and test sets, with an 80-20 split. After training, the model achieves
99% precision on both the training set and the test set. We'd expect a lower precision
on the test set, so we take another look at the data and discover that many of the
examples in the test set are duplicates of examples in the training set (we neglected
to scrub duplicate entries for the same spam email from our input database before
splitting the data). We've inadvertently trained on some of our test data, and as a
result, we're no longer accurately measuring how well our model generalizes to new
data.
Validation Set
Partitioning a data set into a training set and test set lets you
judge whether a given model will generalize well to new data.
However, using only two partitions may be insufficient when
doing many rounds of hyperparameter tuning.
The previous module introduced partitioning a data set into a training set and a test
set. This partitioning enabled you to train on one set of examples and then to test the
model against a different set of examples. With two partitions, the workflow could
look as follows:
In the figure, "Tweak model" means adjusting anything about the model you can
dream up—from changing the learning rate, to adding or removing features, to
designing a completely new model from scratch. At the end of this workflow, you
pick the model that does best on the test set.
Dividing the data set into two sets is a good idea, but not a panacea. You can greatly
reduce your chances of overfitting by partitioning the data set into the three subsets
shown in the following figure:
Use the validation set to evaluate results from the training set. Then, use the test set
to double-check your evaluation after the model has "passed" the validation set. The
following figure shows this new workflow:
Figure 3. A better workflow.
This is a better workflow because it creates fewer exposures to the test set.
Consider the following generalization curve, which shows the loss for both the
training set and validation set against the number of training iterations.
Figure 1 shows a model in which training loss gradually decreases, but validation
loss eventually goes up. In other words, this generalization curve shows that the
model is overfitting to the data in the training set. Channeling our inner Ockham,
perhaps we could prevent overfitting by penalizing complex models, a principle
called regularization.
minimize(Loss(Data|Model))
minimize(Loss(Data|Model) + complexity(Model))
Our training optimization algorithm is now a function of two terms: the loss term,
which measures how well the model fits the data, and the regularization term, which
measures model complexity.
Machine Learning Crash Course focuses on two common (and somewhat related)
ways to think of model complexity:
• Model complexity as a function of the weights of all the features in the model.
• Model complexity as a function of the total number of features with nonzero weights.
(A later module covers this approach.)
If model complexity is a function of weights, a feature weight with a high absolute
value is more complex than a feature weight with a low absolute value.
We can quantify complexity using the L2 regularization formula, which defines the
regularization term as the sum of the squares of all the feature weights:
𝐿2 regularization term=||𝑤||22=𝑤12+𝑤22+...+𝑤𝑛2
In this formula, weights close to zero have little effect on model complexity, while
outlier weights can have a huge impact.
Model developers tune the overall impact of the regularization term by multiplying its
value by a scalar known as lambda (also called the regularization rate). That is,
model developers aim to do the following:
minimize(Loss(Data|Model)+𝜆 complexity(Model))
Increasing the lambda value strengthens the regularization effect. For example, the
histogram of weights for a high value of lambda might look as shown in Figure 2.
Lowering the value of lambda tends to yield a flatter histogram, as shown in Figure 3.
When choosing a lambda value, the goal is to strike the right balance between
simplicity and training-data fit:
• If your lambda value is too high, your model will be simple, but you run the risk
of underfitting your data. Your model won't learn enough about the training
data to make useful predictions.
• If your lambda value is too low, your model will be more complex, and you run
the risk of overfitting your data. Your model will learn too much about the
particularities of the training data, and won't be able to generalize to new
data.
Note: Setting lambda to zero removes regularization completely. In this case, training
focuses exclusively on minimizing loss, which poses the highest possible overfitting risk.
The ideal value of lambda produces a model that generalizes well to new, previously
unseen data. Unfortunately, that ideal value of lambda is data-dependent, so you'll
need to do some tuning.
• "As is"
Let's consider how we might use the probability "as is." Suppose we create a logistic
regression model to predict the probability that a dog will bark during the middle of
the night. We'll call that probability:
𝑝(𝑏𝑎𝑟𝑘|𝑛𝑖𝑔ℎ𝑡)
If the logistic regression model predicts 𝑝(𝑏𝑎𝑟𝑘|𝑛𝑖𝑔ℎ𝑡)=0.05, then over a year, the
dog's owners should be startled awake approximately 18 times:
𝑠𝑡𝑎𝑟𝑡𝑙𝑒𝑑=𝑝(𝑏𝑎𝑟𝑘|𝑛𝑖𝑔ℎ𝑡)⋅𝑛𝑖𝑔ℎ𝑡𝑠=0.05⋅365 =18
In many cases, you'll map the logistic regression output into the solution to a binary
classification problem, in which the goal is to correctly predict one of two possible
labels (e.g., "spam" or "not spam"). A later module focuses on that.
You might be wondering how a logistic regression model can ensure output that
always falls between 0 and 1. As it happens, a sigmoid function, defined as follows,
produces output having those same characteristics:
𝑦=11+𝑒−𝑧
If 𝑧 represents the output of the linear layer of a model trained with logistic
regression, then 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑧) will yield a value (a probability) between 0 and 1. In
mathematical terms:
𝑦′=11+𝑒−𝑧
where:
Log Loss=∑(𝑥,𝑦)∈𝐷−𝑦log(𝑦′)−(1−𝑦)log(1−𝑦′)
where:
• (𝑥,𝑦)∈𝐷 is the data set containing many labeled examples, which are (𝑥,𝑦) pairs.
• 𝑦 is the label in a labeled example. Since this is logistic regression, every value
of 𝑦 must either be 0 or 1.
• 𝑦′ is the predicted value (somewhere between 0 and 1), given the set of features in 𝑥.
• L2 regularization.
• Early stopping, that is, limiting the number of training steps or the learning rate.
Imagine that you assign a unique id to each example, and map each id to its own
feature. If you don't specify a regularization function, the model will become
completely overfit. That's because the model would try to drive loss to zero on all
examples and never get there, driving the weights for each indicator feature to
+infinity or -infinity. This can happen in high dimensional data with feature crosses,
when there’s a huge mass of rare crosses that happen only on one example each.
Summary
Classification: Thresholding
bookmark_border
Estimated Time: 2 minutes
Logistic regression returns a probability. You can use the returned probability "as is"
(for example, the probability that the user will click on this ad is 0.00023) or convert
the returned probability to a binary value (for example, this email is spam).
A logistic regression model that returns 0.9995 for a particular email message is
predicting that it is very likely to be spam. Conversely, another email message with a
prediction score of 0.0003 on that same logistic regression model is very likely not
spam. However, what about an email message with a prediction score of 0.6? In
order to map a logistic regression value to a binary category, you must define
a classification threshold (also called the decision threshold). A value above that
threshold indicates "spam"; a value below indicates "not spam." It is tempting to
assume that the classification threshold should always be 0.5, but thresholds are
problem-dependent, and are therefore values that you must tune.
The following sections take a closer look at metrics you can use to evaluate a
classification model's predictions, as well as the impact of changing the
classification threshold on these predictions.
Note: "Tuning" a threshold for logistic regression is different from tuning hyperparameters
such as learning rate. Part of choosing a threshold is assessing how much you'll suffer for
making a mistake. For example, mistakenly labeling a non-spam message as spam is very
bad. However, mistakenly labeling a spam message as non-spam is unpleasant, but hardly
the end of your job.
In this section, we'll define the primary building blocks of the metrics we'll use to
evaluate classification models. But first, a fable:
A shepherd boy gets bored tending the town's flock. To have some fun, he cries out,
"Wolf!" even though no wolf is in sight. The villagers run to protect the flock, but then
get really mad when they realize the boy was playing a joke on them.
One night, the shepherd boy sees a real wolf approaching the flock and calls out,
"Wolf!" The villagers refuse to be fooled again and stay in their houses. The hungry
wolf turns the flock into lamb chops. The town goes hungry. Panic ensues.
We can summarize our "wolf-prediction" model using a 2x2 confusion matrix that
depicts all four possible outcomes:
True Positive (TP): False Positive (FP):
• Outcome: Shepherd is a hero. • Outcome: Villagers are angry at shepherd for waking
them up.
False Negative (FN): True Negative (TN):
• Outcome: The wolf ate all the sheep. • Outcome: Everyone is fine.
A true positive is an outcome where the model correctly predicts the positive class.
Similarly, a true negative is an outcome where the model correctly predicts
the negative class.
In the following sections, we'll look at how to evaluate classification models using
metrics derived from these four outcomes.
Classification: Accuracy
For binary classification, accuracy can also be calculated in terms of positives and
negatives as follows:
Accuracy=𝑇𝑃+𝑇𝑁𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
Let's try calculating accuracy for the following model that classified 100 tumors
as malignant (the positive class) or benign (the negative class):
True Positive (TP): False Positive (FP):
Accuracy comes out to 0.91, or 91% (91 correct predictions out of 100 total
examples). That means our tumor classifier is doing a great job of identifying
malignancies, right?
Actually, let's do a closer analysis of positives and negatives to gain more insight
into our model's performance.
Of the 100 tumor examples, 91 are benign (90 TNs and 1 FP) and 9 are malignant (1
TP and 8 FNs).
Of the 91 benign tumors, the model correctly identifies 90 as benign. That's good.
However, of the 9 malignant tumors, the model only correctly identifies 1 as
malignant—a terrible outcome, as 8 out of 9 malignancies go undiagnosed!
While 91% accuracy may seem good at first glance, another tumor-classifier model
that always predicts benign would achieve the exact same accuracy (91/100 correct
predictions) on our examples. In other words, our model is no better than one that
has zero predictive ability to distinguish malignant tumors from benign tumors.
Accuracy alone doesn't tell the full story when you're working with a class-
imbalanced data set, like this one, where there is a significant disparity between the
number of positive and negative labels.
In the next section, we'll look at two better metrics for evaluating class-imbalanced
problems: precision and recall.
Precision
Precision attempts to answer the following question:
Let's calculate precision for our ML model from the previous section that analyzes
tumors:
Our model has a precision of 0.5—in other words, when it predicts a tumor is
malignant, it is correct 50% of the time.
Recall
Recall attempts to answer the following question:
Recall=𝑇𝑃𝑇𝑃+𝐹𝑁
Note: A model that produces no false negatives has a recall of 1.0.
Our model has a recall of 0.11—in other words, it correctly identifies 11% of all
malignant tumors.
Let's calculate precision and recall based on the results shown in Figure 1:
Precision measures the percentage of emails flagged as spam that were correctly
classified—that is, the percentage of dots to the right of the threshold line that are
green in Figure 1:
Precision=𝑇𝑃𝑇𝑃+𝐹𝑃=88+2=0.8
Recall measures the percentage of actual spam emails that were correctly
classified—that is, the percentage of green dots that are to the right of the threshold
line in Figure 1:
Recall=𝑇𝑃𝑇𝑃+𝐹𝑁=88+3=0.73
The number of false positives decreases, but false negatives increase. As a result,
precision increases, while recall decreases:
False positives increase, and false negatives decrease. As a result, this time,
precision decreases and recall increases:
True Positives (TP): 9 False Positives (FP): 3
False Negatives (FN): 2 True Negatives (TN): 16
Precision=𝑇𝑃𝑇𝑃+𝐹𝑃=99+3=0.75
Recall=𝑇𝑃𝑇𝑃+𝐹𝑁=99+2=0.82
Various metrics have been developed that rely on both precision and recall. For
example, see F1 score.
ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds. This curve
plots two parameters:
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
𝑇𝑃𝑅=𝑇𝑃𝑇𝑃+𝐹𝑁
𝐹𝑃𝑅=𝐹𝑃𝐹𝑃+𝑇𝑁
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the
classification threshold classifies more items as positive, thus increasing both False
Positives and True Positives. The following figure shows a typical ROC curve.
AUC represents the probability that a random positive (green) example is positioned
to the right of a random negative (red) example.
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an
AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
• AUC is scale-invariant. It measures how well predictions are ranked, rather than their
absolute values.
However, both these reasons come with caveats, which may limit the usefulness of
AUC in certain use cases:
Prediction bias is a quantity that measures how far apart those two averages are.
That is:
A significant nonzero prediction bias tells you there is a bug somewhere in your
model, as it indicates that the model is wrong about how frequently positive labels
occur.
For example, let's say we know that on average, 1% of all emails are spam. If we
don't know anything at all about a given email, we should predict that it's 1% likely to
be spam. Similarly, a good spam model should predict on average that emails are 1%
likely to be spam. (In other words, if we average the predicted likelihoods of each
individual email being spam, the result should be 1%.) If instead, the model's average
prediction is 20% likelihood of being spam, we can conclude that it exhibits
prediction bias.
• Buggy pipeline
• You've built a more brittle system that you must now keep up to date.
If possible, avoid calibration layers. Projects that use calibration layers tend to
become reliant on them—using calibration layers to fix all their model's sins.
Ultimately, maintaining the calibration layers can become a nightmare.
Note: A good model will usually have near-zero bias. That said, a low prediction bias does
not prove that your model is good. A really terrible model could have a zero prediction bias.
For example, a model that just predicts the mean value for all examples would be a bad
model, despite having zero bias.
• Forming quantiles.
Consider the following calibration plot from a particular model. Each dot represents
a bucket of 1,000 values. The axes have the following meanings:
• The x-axis represents the average of values the model predicted for that bucket.
• The y-axis represents the actual average of values in the data set for that bucket.
Why are the predictions so poor for only part of the model? Here are a few
possibilities:
• The training set doesn't adequately represent certain subsets of the data space.
Prediction bias is a quantity that measures how far apart those two averages are.
That is:
A significant nonzero prediction bias tells you there is a bug somewhere in your
model, as it indicates that the model is wrong about how frequently positive labels
occur.
For example, let's say we know that on average, 1% of all emails are spam. If we
don't know anything at all about a given email, we should predict that it's 1% likely to
be spam. Similarly, a good spam model should predict on average that emails are 1%
likely to be spam. (In other words, if we average the predicted likelihoods of each
individual email being spam, the result should be 1%.) If instead, the model's average
prediction is 20% likelihood of being spam, we can conclude that it exhibits
prediction bias.
• Buggy pipeline
• You've built a more brittle system that you must now keep up to date.
If possible, avoid calibration layers. Projects that use calibration layers tend to
become reliant on them—using calibration layers to fix all their model's sins.
Ultimately, maintaining the calibration layers can become a nightmare.
Note: A good model will usually have near-zero bias. That said, a low prediction bias does
not prove that your model is good. A really terrible model could have a zero prediction bias.
For example, a model that just predicts the mean value for all examples would be a bad
model, despite having zero bias.
Bucketing and Prediction Bias
Logistic regression predicts a value between 0 and 1. However, all labeled examples
are either exactly 0 (meaning, for example, "not spam") or exactly 1 (meaning, for
example, "spam"). Therefore, when examining prediction bias, you cannot accurately
determine the prediction bias based on only one example; you must examine the
prediction bias on a "bucket" of examples. That is, prediction bias for logistic
regression only makes sense when grouping enough examples together to be able to
compare a predicted value (for example, 0.392) to observed values (for example,
0.394).
• Forming quantiles.
Consider the following calibration plot from a particular model. Each dot represents
a bucket of 1,000 values. The axes have the following meanings:
• The x-axis represents the average of values the model predicted for that bucket.
• The y-axis represents the actual average of values in the data set for that bucket.
Why are the predictions so poor for only part of the model? Here are a few
possibilities:
• The training set doesn't adequately represent certain subsets of the data space.
If you recall from the Feature Crosses unit, the following classification problem is
nonlinear:
Figure 1. Nonlinear classification problem.
"Nonlinear" means that you can't accurately predict a label with a model of the
form 𝑏+𝑤1𝑥1+𝑤2𝑥2 In other words, the "decision surface" is not a line. Previously,
we looked at feature crosses as one possible approach to modeling nonlinear
problems.
The data set shown in Figure 2 can't be solved with a linear model.
To see how neural networks might help with nonlinear problems, let's start by
representing a linear model as a graph:
Figure 3. Linear model as graph.
Each blue circle represents an input feature, and the green circle represents the
weighted sum of the inputs.
How can we alter this model to improve its ability to deal with nonlinear problems?
Hidden Layers
In the model represented by the following graph, we've added a "hidden layer" of
intermediary values. Each yellow node in the hidden layer is a weighted sum of the
blue input node values. The output is a weighted sum of the yellow nodes.
Is this model linear? Yes—its output is still a linear combination of its inputs.
In the model represented by the following graph, we've added a second hidden layer
of weighted sums.
Is this model still linear? Yes, it is. When you express the output as a function of the
input and simplify, you get just another weighted sum of the inputs. This sum won't
effectively model the nonlinear problem in Figure 2.
Activation Functions
To model a nonlinear problem, we can directly introduce a nonlinearity. We can pipe
each hidden layer node through a nonlinear function.
In the model represented by the following graph, the value of each node in Hidden
Layer 1 is transformed by a nonlinear function before being passed on to the
weighted sums of the next layer. This nonlinear function is called the activation
function.
Now that we've added an activation function, adding layers has more impact.
Stacking nonlinearities on nonlinearities lets us model very complicated
relationships between the inputs and the predicted outputs. In brief, each layer is
effectively learning a more complex, higher-level function over the raw inputs. If
you'd like to develop more intuition on how this works, see Chris Olah's excellent
blog post.
The following sigmoid activation function converts the weighted sum to a value
between 0 and 1.
𝐹(𝑥)=11+𝑒−𝑥
Here's a plot:
The following rectified linear unit activation function (or ReLU, for short) often works
a little better than a smooth function like the sigmoid, while also being significantly
easier to compute.
𝐹(𝑥)=𝑚𝑎𝑥(0,𝑥)
TensorFlow provides out-of-the-box support for many activation functions. You can
find these activation functions within TensorFlow's list of wrappers for primitive
neural network operations. That said, we still recommend starting with ReLU.
Summary
Now our model has all the standard components of what people usually mean when
they say "neural network":
• A set of weights representing the connections between each neural network layer
and the layer beneath it. The layer beneath may be another neural network layer, or
some other kind of layer.
• An activation function that transforms the output of each node in a layer. Different
layers may have different activation functions.
Dropout Regularization
This section explains backpropagation's failure cases and the most common way to
regularize a neural network.
Failure Cases
There are a number of common ways for backpropagation to go wrong.
Vanishing Gradients
The gradients for the lower layers (closer to the input) can become very small. In
deep networks, computing these gradients can involve taking the product of many
small terms.
When the gradients vanish toward 0 for the lower layers, these layers train very
slowly, or not at all.
Exploding Gradients
If the weights in a network are very large, then the gradients for the lower layers
involve products of many large terms. In this case you can have exploding gradients:
gradients that get too large to converge.
Batch normalization can help prevent exploding gradients, as can lowering the
learning rate.
Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. It
outputs 0 activation, contributing nothing to the network's output, and gradients can
no longer flow through it during backpropagation. With a source of gradients cut off,
the input to the ReLU may not ever change enough to bring the weighted sum back
above 0.
Lowering the learning rate can help keep ReLU units from dying.
Dropout Regularization
Yet another form of regularization, called Dropout, is useful for neural networks. It
works by randomly "dropping out" unit activations in a network for a single gradient
step. The more you drop out, the stronger the regularization:
One vs. all provides a way to leverage binary classification. Given a classification
problem with N possible solutions, a one-vs.-all solution consists of N separate
binary classifiers—one binary classifier for each possible outcome. During training,
the model runs through a sequence of binary classifiers, training each to answer a
separate classification question. For example, given a picture of a dog, five different
recognizers might be trained, four seeing the image as a negative example (not an
apple, not a bear, etc.) and one seeing the image as a positive example (a dog). That
is:
This approach is fairly reasonable when the total number of classes is small, but
becomes increasingly inefficient as the number of classes rises.
We can create a significantly more efficient one-vs.-all model with a deep neural
network in which each output node represents a different class. The following figure
suggests this approach:
Softmax extends this idea into a multi-class world. That is, Softmax assigns decimal
probabilities to each class in a multi-class problem. Those decimal probabilities
must add up to 1.0. This additional constraint helps training converge more quickly
than it otherwise would.
For example, returning to the image analysis we saw in Figure 1, Softmax might
produce the following likelihoods of an image belonging to a particular class:
Class Probability
apple 0.001
bear 0.04
candy 0.008
dog 0.95
egg 0.001
Softmax is implemented through a neural network layer just before the output layer.
The Softmax layer must have the same number of nodes as the output layer.
Softmax Options
Consider the following variants of Softmax:
• Full Softmax is the Softmax we've been discussing; that is, Softmax
calculates a probability for every possible class.
• Candidate sampling means that Softmax calculates a probability for all the
positive labels but only for a random sample of negative labels. For example,
if we are interested in determining whether an input image is a beagle or a
bloodhound, we don't have to provide probabilities for every non-doggy
example.
Full Softmax is fairly cheap when the number of classes is small but becomes
prohibitively expensive when the number of classes climbs. Candidate sampling can
improve efficiency in problems having a large number of classes.
For example, suppose your examples are images containing exactly one item—a
piece of fruit. Softmax can determine the likelihood of that one item being a pear, an
orange, an apple, and so on. If your examples are images containing all sorts of
things—bowls of different kinds of fruit—then you'll have to use multiple logistic
regressions instead.
Embeddings
bookmark_border
An embedding is a relatively low-dimensional space into which you can translate
high-dimensional vectors. Embeddings make it easier to do machine learning on
large inputs like sparse vectors representing words. Ideally, an embedding captures
some of the semantics of the input by placing semantically similar inputs close
together in the embedding space. An embedding can be learned and reused across
models.
Collaborative filtering is the task of making predictions about the interests of a user
based on interests of many other users. As an example, let's look at the task of
movie recommendation. Suppose we have 500,000 users, and a list of the movies
each user has watched (from a catalog of 1,000,000 movies). Our goal is to
recommend movies to users.
To solve this problem some method is needed to determine which movies are
similar to each other. We can achieve this goal by embedding the movies into a low-
dimensional space created such that similar movies are nearby.
Before describing how we can learn the embedding, we first explore the type of
qualities we want the embedding to have, and how we will represent the training data
for learning the embedding.
Bleu R A French widow grieves the loss of her husband and daughter after they perish in a car
accident.
The Dark Knight Rises PG-13 Batman endeavors to save Gotham City from nuclear annihilation in this sequel to The
Dark Knight, set in the DC Comics universe.
Harry Potter and the PG A orphaned boy discovers he is a wizard and enrolls in Hogwarts School of Witchcraft
Sorcerer's Stone and Wizardry, where he wages his first battle against the evil Lord Voldemort.
The Incredibles PG A family of superheroes forced to live as civilians in suburbia come out of retirement
to save the superhero race from Syndrome and his killer robot.
Shrek PG A lovable ogre and his donkey sidekick set off on a mission to rescue Princess Fiona,
who is emprisoned in her castle by a dragon.
Star Wars PG Luke Skywalker and Han Solo team up with two androids to rescue Princess Leia and
save the galaxy.
The Triplets of PG-13 When professional cycler Champion is kidnapped during the Tour de France, his
Belleville grandmother and overweight dog journey overseas to rescue him, with the help of a
trio of elderly jazz singers.
Memento R An amnesiac desperately seeks to solve his wife's murder by tattooing clues onto his
body.
Categorical data refers to input features that represent one or more discrete items
from a finite set of choices. For example, it can be the set of movies a user has
watched, the set of words in a document, or the occupation of a person.
Categorical data is most efficiently represented via sparse tensors, which are
tensors with very few non-zero elements. For example, if we're building a movie
recommendation model, we can assign a unique ID to each possible movie, and then
represent each user by a sparse tensor of the movies they have watched, as shown
in Figure 3.
Likewise one can represent words, sentences, and documents as sparse vectors
where each word in the vocabulary plays a role similar to the movies in our
recommendation example.
The simplest way is to define a giant input layer with a node for every word in your
vocabulary, or at least a node for every word that appears in your data. If 500,000
unique words appear in your data, you could represent a word with a length 500,000
vector and assign each word to a slot in the vector.
If you assign "horse" to index 1247, then to feed "horse" into your network you might
copy a 1 into the 1247th input node and 0s into all the rest. This sort of
representation is called a one-hot encoding, because only one index has a non-zero
value.
More typically your vector might contain counts of the words in a larger chunk of
text. This is known as a "bag of words" representation. In a bag-of-words vector,
several of the 500,000 nodes would have non-zero value.
But however you determine the non-zero values, one-node-per-word gives you
very sparse input vectors—very large vectors with relatively few non-zero values.
Sparse representations have a couple of problems that can make it hard for a model
to learn effectively.
Size of Network
Huge input vectors mean a super-huge number of weights for a neural network. If
there are M words in your vocabulary and N nodes in the first layer of the network
above the input, you have MxN weights to train for that layer. A large number of
weights causes further problems:
• Amount of data. The more weights in your model, the more data you need to
train effectively.
• Amount of computation. The more weights, the more computation required to
train and use the model. It's easy to exceed the capabilities of your hardware.
For example, principal component analysis (PCA) has been used to create word
embeddings. Given a set of instances like bag of words vectors, PCA tries to find
highly correlated dimensions that can be collapsed into a single dimension.
Word2vec
Word2vec is an algorithm invented at Google for training word embeddings.
Word2vec relies on the distributional hypothesis to map semantically similar words
to geometrically close embedding vectors.
The distributional hypothesis states that words which often have the same
neighboring words tend to be semantically similar. Both "dog" and "cat" frequently
appear close to the word "veterinarian", and this fact reflects their semantic
similarity. As the linguist John Firth put it in 1957, "You shall know a word by the
company it keeps".
The other version of the algorithm creates negative examples by pairing the true
target word with randomly chosen context words. So it might take the positive
examples (the, plane), (flies, plane) and the negative examples (compiled, plane),
(who, plane) and learn to identify which pairs actually appeared together in text.
The classifier is not the real goal for either version of the system, however. After the
model has been trained, you have an embedding. You can use the weights
connecting the input layer with the hidden layer to map sparse representations of
words to smaller vectors. This embedding can be reused in other classifiers.
As another example if you want to create an embedding layer for the words in a real-
estate ad as part of a DNN to predict housing prices then you'd optimize an L2 Loss
using the known sale price of homes in your training data as the label.
Production ML Systems
System-Level Components
• A static model is trained offline. That is, we train the model exactly once and then use
that trained model for a while.
• A dynamic model is trained online. That is, data is continually entering the system
and we're incorporating that data into the model through continuous updates.
Broadly speaking, the following points dominate the static vs. dynamic training
decision:
• Dynamic models adapt to changing data. The world is a highly changeable place.
Sales predictions built from last year's data are unlikely to successfully predict next
year's results.
If your data set truly isn't changing over time, choose static training because it is
cheaper to create and maintain than dynamic training. However, many information
sources really do change over time, even those with features that you think are as
constant as, say, sea level. The moral: even with static training, you must still
monitor your input data for change.
For example, consider a model trained to predict the probability that users will buy
flowers. Because of time pressure, the model is trained only once using a dataset of
flower buying behavior during July and August. The model is then shipped off to
serve predictions in production, but is never updated. The model works fine for
several months, but then makes terrible predictions around Valentine's Day because
user behavior during that holiday period changes dramatically.
Data Dependencies
Reliability
Versioning
• Does the system that computes this data ever change? If so:
• How often?
• How will you know when that system changes?
Consider creating your own copy of the data you receive from the upstream process.
Then, only advance to the next version of the upstream data when you are certain
that it is safe to do so.
Necessity
• Does the usefulness of the feature justify the cost of including it?
It is always tempting to add more features to the model. For example, suppose you
find a new feature whose addition makes your model slightly more accurate. More
accuracy certainly sounds better than less accuracy. However, now you've just added
to your maintenance burden. That additional feature could degrade unexpectedly, so
you've got to monitor it. Think carefully before adding features that lead to minor
short-term wins.
Correlations
Some features correlate (positively or negatively) with other features. Ask yourself
the following question:
• Are any features so tied together that you need additional strategies to tease
them apart?
Feedback Loops
Sometimes a model can affect its own training data. For example, the results from
some models, in turn, are directly or indirectly input features to that same model.
Sometimes a model can affect another model. For example, consider two models for
predicting stock prices:
Fairness
Fairness: Types of Bias
bookmark_border
Estimated Time: 5 minutes
Machine learning models are not inherently objective. Engineers train models by
feeding them a data set of training examples, and human involvement in the
provision and curation of this data can make a model's predictions susceptible to
bias.
When building models, it's important to be aware of common human biases that can
manifest in your data, so you can take proactive steps to mitigate their effects.
WARNING: The following inventory of biases provides just a small selection of biases that
are often uncovered in machine learning data sets; this list is not intended to be exhaustive.
Wikipedia's catalog of cognitive biases enumerates over 100 different types of human bias
that can affect our judgment. When auditing your data, you should be on the lookout for any
and all potential sources of bias that might skew your model's predictions.
Reporting Bias
Reporting bias occurs when the frequency of events, properties, and/or outcomes
captured in a data set does not accurately reflect their real-world frequency. This
bias can arise because people tend to focus on documenting circumstances that are
unusual or especially memorable, assuming that the ordinary can "go without
saying."
Automation Bias
Automation bias is a tendency to favor results generated by automated systems
over those generated by non-automated systems, irrespective of the error rates of
each.
EXAMPLE: Software engineers working for a sprocket manufacturer were eager to deploy
the new "groundbreaking" model they trained to identify tooth defects, until the factory
supervisor pointed out that the model's precision and recall rates were both 15% lower than
those of human inspectors.
Selection Bias
Selection bias occurs if a data set's examples are chosen in a way that is not
reflective of their real-world distribution. Selection bias can take many different
forms:
EXAMPLE: A model is trained to predict future sales of a new product based on phone
surveys conducted with a sample of consumers who bought the product. Consumers who
instead opted to buy a competing product were not surveyed, and as a result, this group of
people was not represented in the training data.
• Non-response bias (or participation bias): Data ends up being unrepresentative due
to participation gaps in the data-collection process.
EXAMPLE: A model is trained to predict future sales of a new product based on phone
surveys conducted with a sample of consumers who bought the product and with a sample
of consumers who bought a competing product. Consumers who bought the competing
product were 80% more likely to refuse to complete the survey, and their data was
underrepresented in the sample.
EXAMPLE: A model is trained to predict future sales of a new product based on phone
surveys conducted with a sample of consumers who bought the product and with a sample
of consumers who bought a competing product. Instead of randomly targeting consumers,
the surveyer chose the first 200 consumers that responded to an email, who might have
been more enthusiastic about the product than average purchasers.
• In-group bias: A preference for members of a group to which you also belong, or for
characteristics that you also share.
EXAMPLE: Two engineers training a résumé-screening model for software developers are
predisposed to believe that applicants who attended the same computer-science academy
as they both did are more qualified for the role.
EXAMPLE: Two engineers training a résumé-screening model for software developers are
predisposed to believe that all applicants who did not attend a computer-science academy
do not have sufficient expertise for the role.
Implicit Bias
Implicit bias occurs when assumptions are made based on one's own mental
models and personal experiences that do not necessarily apply more generally.