All Cards
All Cards
Unit normalization, also known as L2 normalization, scales data in a way that each data
point has a unit norm, meaning its length becomes 1. This normalization technique is
applied to ensure that the absolute scale of the features does not dominate their
contributions. It is particularly useful when the relative magnitudes of the features
are more important.
pre-processing machinelearningflashcards.com
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering
algorithm used to group samples that are close to each other in the feature space. It
defines clusters as dense regions separated by sparser regions, allowing it to discover
clusters of arbitrary shape.
sample
Bayes error serves as a theoretical limit on the performance of any machine learning
algorithm for a specific task.
classifier machinelearningflashcards.com
Effect Of Feature Scaling On
gradient Descent
When the features have different scales, gradient descent may take longer to converge
or even fail to converge at all. Feature scaling, such as normalization or
standardization, ensures that all features are on a similar scale, allowing gradient
descent to avoid disproportionately large updates based on features with larger values.
direction of steepest
gradient not best for
scaled feature 1
finding the minimum
feature 1
3 nearest neighbors
sample
clustering machinelearningflashcards.com
Out-of-Bag Errors
Out-of-bag (OOB) error is a measure used in ensemble learning, specifically in
random forest algorithms. It estimates the prediction error of the model without the
need for a separate validation set. OOB error is computed by evaluating the model's
performance on the training data points that were not included in the construction
of each individual decision tree in the forest.
regularization machinelearningflashcards.com
Decision Trees
Decision trees recursively split the data based on the feature values that create the
highest information gains, creating a tree-like structure. Each leaf node represents
a predicted output value. Decision trees make predictions by traversing the tree
from the root to a leaf node based on the sample’s feature values. Decision trees
are highly interpretable, they can even be drawn in their entirety.
leaf passed
failed passed
tree-based models machinelearningflashcards.com
Confusion Matrix
A confusion matrix is a table that helps evaluate the performance of a machine
learning model by comparing its predicted outcomes against the actual outcomes.
The rows correspond to the actual classes , while the columns correspond to the
predicted classes. Each cell represents the number of samples that belong to a
particular combination of actual and predicted classes.
predicted classes
class 1 class 2 class 3 class 4
Diagonal cells are correctly
predicted samples.
actual classes
class 1 12 6 1 1
class 2 0 20 0 0
class 3 8 1 10 2
class 4 0 5 2 8
evaluation machinelearningflashcards.com
Hinge Loss
Hinge loss is a loss function used in binary classification problems, especially support
vector machines. Hinge loss penalizes misclassified samples by measuring the
distance between the predicted class and the ground truth class, with a margin that
encourages correct classification. Hinge loss maximizes the margin between classes
and is robust against outliers.
theory machinelearningflashcards.com
Bias
Bias refers to the presence of systematic errors in a model's predictions due to
assumptions or oversimplification during the learning process. High bias can cause
underfitting, where the model is not complex enough to learn the nuances of the data
and make accurate predictions.
with momentum
Stochastic gradient descent with momentum uses an exponentially weighted average of
past gradients to update the momentum term and the model's parameters at each
iteration. It helps the optimizer maintain a more stable direction and speed up
convergence.
momentum parameter controls
influences of past gradients
momentum at
iteration t gradient loss function for
parameters at a single sample
iteration t
learning rate
optimizers machinelearningflashcards.com
Pre-Processing
theory machinelearningflashcards.com
Learning Curve
A learning curve is a plot that shows the relationship between the model's
performance and the size of the training set.
training score
performance
test score
metrics machinelearningflashcards.com
False Positive Rate
False positive rate (FPR) is a metric used to assess the performance of a binary
classification model. It measures the proportion of negative instances that are
incorrectly classified as positive by the model.
number of negative
number of correctly
instances incorrectly
classified negative instances
classified as positive
metrics machinelearningflashcards.com
Random Initialization
Of Neural Network Parameters
Parameters are initialized with random values from a certain distribution, such as a
normal or uniform distribution, enabling symmetry breaking and promoting diverse
learning.
random numbers
minority class
pre-processing machinelearningflashcards.com
Minkowski Distance
Minkowski distance is used to measure the similarity or dissimilarity between two
points in a multidimensional space. It is a generalization of other distance metrics
like Euclidean distance and Manhattan distance.
metric machinelearningflashcards.com
Learning Rate
Learning rate controls the magnitude of the adjustments made to the parameters
based on the gradients of the loss function. A too small rate may result in slow
convergence, while a too large rate can lead to instability and overshooting the
optimal solution.
on
cti
un
sf
learning rate too large
los
learning rate too small
pre-processing machinelearningflashcards.com
Principal Component Analysis
Principal Component Analysis (PCA) projects the data onto the axises (known as
principal components) along which the data varies the most (i.e. contains the most
information). The motivation for PCA is to reduce the number of features while only
losing a small amount information.
sample
theory machinelearningflashcards.com
Motivation For
Deep Networks
It has been shown that hypothetically a feedforward neural network with one layer is
good enough for most tasks. However, that single layer might have to be
impractically large. The motivation for deep networks is to have the same predictive
power with a much smaller number of units.
deep network
shallow network
Hyperparameters are external settings that determine the behavior of the model.
Hyperparameters are typically set before training and include values such as
learning rate, regularization strength, and the number of hidden layers in a neural
network. Parameters are the internal variables of a model that are learned during
the training process. They are adjusted by the learning algorithm to optimize the
Example of a hyperparameter
Example of a parameter
theory machinelearningflashcards.com
Boosting
Boosting is an ensemble learning technique used in machine learning to improve the
performance of weak learners by combining their predictions. It works by iteratively
training a sequence of weak learners, with each learner focusing on correcting the
errors made by the previous one. The final model is a weighted combination of these
weak learners, resulting in a stronger, more accurate prediction.
test loss
loss
2
variance
bias
10
input
-10 10
controls slope in
the negative region
-10
predicted probability
(example: 0.8)
actual outcome
metrics machinelearningflashcards.com
Feature Selection
Strategies
Univariate Selection: Select features based on their individual statistical significance
or correlation with the target variable.
Principal Component Analysis (PCA): Transform the original features into a smaller
set of uncorrelated components that capture most of the variance in the data.
Feature Importance from Tree-Based Models: Use importance scores derived from
decision trees or random forests to rank and select the features based on their
predictive power.
strategies machinelearningflashcards.com
Accuracy
Accuracy is a metric used to evaluate the performance of a classification model. It
represents the proportion of correct predictions made by the model out of the total
number of predictions.
While accuracy can be a useful indicator of model performance, it may not always be
the best metric, especially when dealing with imbalanced datasets.
classification machinelearningflashcards.com
Neurons
A neuron is a fundamental building block of neural networks. Inspired the behavior of
a biological neuron, neural network neurons take in multiple input signals, applying
weights and biases to them, and then applying an activation function to produce an
output. Neurons work collectively in layers to process and transform input data,
enabling the network to learn complex patterns and make predictions.
inputs weights
bias
neural networks machinelearningflashcards.com
Mean Squared Error
Mean squared error (MSE) is a loss function commonly used in regression. MSE
measures the average squared difference between the predicted output and the true
output. It is the sum of the squared differences between the predicted and true values
and dividing it by the number of samples.
predicted output
true output
number of samples
decision boundary
sample
metrics machinelearningflashcards.com
Hidden Layers
Hidden layers in a neural network are the intermediate layers between the input and
output layers that process and transform input data. They are called hidden layers
because they are not directly exposed to the input data or the final output predictions.
Instead, they lie "hidden" between these layers.
input layer
output layer
hidden layers
10
input
-10 10
0
determines slope
for negative values -10
loss function
local minimum
global minimum
local minimum
theory machinelearningflashcards.com
Categorical features
Categorical features represent distinct, non-numeric types or groups. These
features have a finite number of unique values, often representing different levels or
classifications.
Nominal Ordinal
No inherent ordering of the Inherent ordering of the categories.
categories. For example, types of For example, levels of pain.
fruit and colors.
features
machinelearningflashcards.com
Noisy Rectified Linear Unit
Noisy ReLU is a variant of the Rectified Linear Unit (ReLU) activation function that
introduces random noise during the forward propagation of the neural network. It
helps reduce overfitting in the model.
10
-10 10
0
-10
gradients
assignment operator
regularization strength
regularization machinelearningflashcards.com
Model Consistency
Model consistency refers to the property of a machine learning model where the
probability the predicted output and true output is greater than some number
approaches zero as the number of samples increases towards infinity.
theory machinelearningflashcards.com
L1 Norm
L1 norm, also known as the Manhattan norm or the Taxicab norm, is a way to measure
the magnitude of a vector by summing the absolute values of its components.
norms machinelearningflashcards.com
Curse of Dimensionality
The curse of dimensionality is a phenomenon where the performance of algorithms
deteriorates as the number of features increases. This is due to the exponential
increase in the volume of the space, making it difficult to obtain enough data to properly
sample the space.
For example:
Imagine a dataset with only two features. It is easy to understand the relationships
between them. However, if we add more features the dimensionality of the data increases
and thus number of data points needed to fill the space becomes exponentially larger.
problems machinelearningflashcards.com
K-Means Clustering
K-Means clustering partitions samples into k clusters based on the similarity of the
samples. It assigns each sample to the cluster whose centroid (mean) is closest to it,
minimizing the within-cluster sum of squared distances. The algorithm iteratively
updates the cluster assignments and centroids until convergence is reached.
clustering machinelearningflashcards.com
Weak Learners
A weak learner is a simple model performs slightly better than random guessing, but
not well enough to be useful on its own. They serve as building blocks in ensemble
methods, where many weak learners are combined to create a more powerful "strong"
learner.
ensemble machinelearningflashcards.com
DownSampling
Downsampling is a strategy to handle data with imbalanced classes by randomly
removing samples from the majority class. Downsampling helps prevent the model
from being biased towards the dominant class and improves the performance of the
majority class
imbalanced data downsampled data
minority class
pre-processing machinelearningflashcards.com
BIAS
Intuition
Bias is the stubbornness of the learning algorithm in the face of new data. Some
bias is necessary for generalization but too much will produce poor models.
new data
data
A
neural network
B 1 epoch
C
neural networks machinelearningflashcards.com
Early stopping
Early stopping is a technique to prevent overfitting by stopping the training process
before the model becomes too complex. It involves monitoring a validation metric
(e.g. validation loss) and terminating training when the metric no longer improves.
This helps to find the optimal point where the model performs well on unseen data.
testing set
validation metric
stop training
here
training set
training iterations
regularization machinelearningflashcards.com
Elastic NET
Elastic Net is a regularization technique that combines both L1 (Lasso) and L2 (Ridge)
regularization to improve the performance of linear regression models. This allows
Elastic Net to simultaneously perform feature selection and handle multicollinearity.
L2 penalty
L1 penalty
linear regression
regularization machinelearningflashcards.com
Gradient Cliffs
A Gradient cliff is where the gradients of a neural network’s loss function experience
an extreme and sudden change. This happens when a model's parameters are updated,
and the gradients become very large or very small. Gradient cliffs can hinder
convergence and make the optimization algorithm overshoot a minimum.
sudd
en d
ecre
ase i
n loss
caus
es op
timi
zer
to ov
ersh
oot
gradient cliff
minimum
true probability
distribution
predicted probability
distribution
possible value
parameters at
iteration t
optimizers machinelearningflashcards.com
Common Output Layer
Activation Functions
An output layer activation function processes the outputs of the neural network's
final layer to produce a desired range of outputs, such as a probability distribution
over the predicted classes.
Maps any input value to a range between Scales the outputs of the model to
0 and 1, commonly used for binary represent a probability distribution over
classification problems. classes, commonly used for multi-class
classification problems.
input values
testing set
errors
training set
model complexity
theory machinelearningflashcards.com
Feature Matrix
A feature matrix, also known as a design matrix or input matrix, is a two-
dimensional array in which rows represent individual samples and columns
represent the features. Each element contains the value of a specific feature for a
particular sample. The feature matrix is a input for model training.
features
samples
data machinelearningflashcards.com
Binary Cross-Entropy Loss
Binary cross-entropy loss measures the difference between the predicted and true
binary classification outputs. It is the negative logarithm of the predicted probability
of the true class.
true class
Example: The decision trees in a random forest make splits that maximize the
decrease in impurity. By calculating the mean decrease in impurity for each feature
across all trees, the importance of each features is quantified.
importance
feature
numerical values. It recursively splits the training data based on the features to
create a tree-like model, where each leaf node represents a predicted value. During
prediction, the algorithm follows the decision path in the tree to reach a leaf node
data
value
pre-processing machinelearningflashcards.com
Bootstrap Sampling
Bootstrap sampling simulates obtaining many new datasets by repeated sampling
with replacement from the original dataset. By generating multiple samples from
the original dataset, it allows for better assessment of model performance.
original data
x1 x2 random sampling with
replacement
1 A 10
2 B 20
3 C 30 x1 x2 x1 x2 x1 x2
values 1 A 10 2 B 20 2 B 20
2 B 20 1 A 10 3 C 30
3 A 10 2 B 20 3 C 30
new dataset 1 new dataset 2 new dataset 3
sampling machinelearningflashcards.com
What Does It Mean To
Learn In Machine Learning
Learning refers to the process by which a machine learning model automatically
adjusts its internal parameters or weights based on the provided training data.
Through this iterative process, the model learns patterns, relationships, and
underlying structures in the data, allowing it to make predictions or perform tasks
without being explicitly programmed.
“A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P if its performance wat tasks in T, as
measured by P, improves with experience E.” -Tom Mitchell (1997)
theory machinelearningflashcards.com
Gradient Descent
Gradient descent is an optimization algorithm that minimizes the loss function and
finds optimal values for a model's parameters. Gradient descent iteratively updates
the parameters in the direction of steepest descent by computing the gradients of
the loss function with respect to the parameters. This process continues until a
minimum of the loss function is reached or other criteria are met.
on
cti
un
iteration 2
sf
los
iteration 3
iteration 3
iteration 4
optimizers machinelearningflashcards.com
Ensemble Methods
Ensemble methods make more accurate predictions by aggregating the output of many
individual models to through voting, averaging, or weighted combinations. Ensemble
methods can often achieve better performance, enhance generalization, and improve
robustness compared to using a single model.
ensemble machinelearningflashcards.com
One Hot Encoding
vectors. It converts each category into a binary feature where a value of 1 indicates
the presence of that category and 0 indicates its absence. This encoding is useful for
machine learning algorithms that cannot directly handle categorical data and require
numerical inputs.
Fruit column
Apple
1
0
Pear
0
1
Apple
1
0
Pear
0
1
Apple 1 0
pre-processing machinelearningflashcards.com
Adaboost
AdaBoost, or Adaptive Boosting, is an ensemble learning technique that combines
multiple weak classifiers to form a strong classifier. By adjusting the weights of
misclassified instances and training new weak classifiers on these weights, AdaBoost
improves classification performance.
5 pixels of a
handwritten digit
10
15
20
25
0 5 10 15 20 25
datasets machinelearningflashcards.com
Bag Of Words
A bag of words is a technique in natural language processing that converts text into
numerical format by creating a vocabulary of unique words and counting their
occurrences. Ignoring word order, each word frequency serves as a feature in a
feature vector.
number of positive
number of correctly
instances incorrectly
classified positive instances
classified as negative
metrics machinelearningflashcards.com
Generalization
Generalization refers to the ability of a model to perform well on data that it has
not been trained on. It indicates the model's capacity to capture underlying patterns
and make accurate predictions on new, unseen examples.
theory machinelearningflashcards.com
Training Error Rate
Training error rate is the number of incorrect predictions divided by the total
number of predictions in the training set. While a lower training error rate is
generally better, it's important to avoid overfitting, where the model performs well on
the training data but poorly on unseen data.
metrics machinelearningflashcards.com
K-Fold Cross-Validation
K-Fold Cross-Validation assesses the performance and generalization ability of a
machine learning model. It involves splitting the dataset into k equally sized folds,
using k-1 folds for training and the remaining fold for validation. This process is
repeated k times, with each fold serving as the validation set exactly once.
cross-validation machinelearningflashcards.com
Mean Shift Clustering
Mean Shift clustering is an unsupervised algorithm used to discover clusters in data
without the need for specifying the number of clusters in advance. It works by
iteratively shifting the data points towards the mean of their local neighborhood
until convergence.
Imagine a foggy football field with 100 people standing on it. Because of the fog,
people can only see a short distance. Every 10 seconds, each person looks around and
takes a step in the direction facing the most people they can see.
As time goes on, people start to group up as they repeatedly take steps towards
larger and larger crowds. The end result is clusters of people around the pitch.
clustering machinelearningflashcards.com
Categorical Cross-Entropy Loss
Categorical cross-entropy loss measures the difference between the predicted and
true probability distributions of multiple classes. It is calculated as the negative
logarithm of the predicted probability of the true class.
true probability of
the i-th class
10
input
-10 10
0
input
-10
11 nearest neighbors
sample
clustering machinelearningflashcards.com
Random Forests
Random Forests is an ensemble learning method that combines multiple decision
trees to make predictions. Each tree in the forest is trained on a random subset of
the training data with replacement, and a random subset of features is considered at
each split. By aggregating the predictions of individual trees, Random Forests can
handle complex datasets, reduce overfitting, and provide robust predictions.
each tree votes by predicting the target class
The validation set is used to tune the hyperparameters of the model and assess its
performance during training.
The test set is used to evaluate the final performance of the trained model after it
has been trained and validated. The test set provides an unbiased assessment of how
well the model generalizes to unseen data.
theory machinelearningflashcards.com
Gini Index
In Random Forests
Gini index is a measure of impurity used for splitting nodes in the decision trees
that make up the random forest. Gini index quantifies the degree of class mixing
within a node by calculating the probability of misclassifying a randomly chosen
sample in the node. The Gini index is minimized when a node contains pure samples
of a single class, and it is used as a criterion to determine the optimal splits in the
decision trees during the construction of the random forest.
receiver operating
true positive rate
characteristic curve
Each mini-batch is randomly sampled from the training set and used to compute the
gradients and update the model parameters using optimization algorithms such as
stochastic gradient descent.
The size of the mini-batch is typically chosen based on factors such as available
memory, computational resources, and trade-offs between convergence speed and
parameter update variance.
norms machinelearningflashcards.com
Dropout
Dropout is a regularization technique used in neural networks where randomly
selected neurons are temporarily deactivated during training. This forces the
network to learn more robust features. During inference, all neurons are active, but
their outputs are scaled by the dropout rate, ensuring that the model can make
predictions with all neurons contributing effectively.
deactivated neuron
regularization machinelearningflashcards.com
F1 Score
F1 score is a metric used to evaluate the performance of a binary classification model,
combining precision and recall into a single measure. It is the harmonic mean of
precision and recall, providing a balanced measure of the model's accuracy.
metrics machinelearningflashcards.com
Overfitting Vs. Underfitting
Overfitting happens when a model learns to fit the training data too closely,
capturing both the underlying patterns and noise in the data. Underfitting, happens
when a model is too simple or lacks the capacity to capture the underlying patterns
in the data.
sample underfitting
overfitting
theory machinelearningflashcards.com
Weight Decay
Weight decay is a regularization technique for neural networks. Weight decay adds a
penalty, usually L2 norm, to the loss function. This reduces the magnitude of the
model's weights and thereby improves performance on unseen data.
L2 norm of the
original loss controls model parameters
function regularization
power
regularization machinelearningflashcards.com
Grid Search
Grid search systematically searches for the optimal hyperparameters for a model. It
involves defining a list of potential hyperparameter values, training a model using
each set of hyperparameters, then evaluating each models' performance using
cross-validation.
By exhaustively trying many possible combinations, grid search helps identify the
best set of hyperparameters that yield the highest performance.
n)
O(n
O(n n)
)
number of operations
O(n 2
log
O(n O(log n)
O(1)
input size
algorithms machinelearningflashcards.com
Error Types
Type I error is a false positive, incorrectly rejecting the null hypothesis. Type II error
is a false negative, failing to detect a true effect. Here is how to remember them:
Type
False Positive
I
Type
False Negative
I I
optimizers machinelearningflashcards.com
Imputation
Imputation is the process of filling in missing values in the data.
Average Imputation: Replace missing values with the mean or median value of the
feature. This is simple and effective.
pre-processing machinelearningflashcards.com
Sigmoid
The sigmoid activation function converts the input value to a range between 0 and 1,
which is interpreted as a probability. It is commonly used in binary classification tasks.
input
-10 10
0
-1
pre-processing machinelearningflashcards.com
Loss Functions
A loss function, also called a cost function, measures the difference between the
predicted output of a machine learning model and the true output, which is used to
optimize the model's parameters during training. They are the function we are
trying to train a model to minimize. Here are some examples:
bagging machinelearningflashcards.com
Tanh
The tanh (hyperbolic tangent) activation function maps the input to a range between
-1 and 1. It is useful for capturing both positive and negative values, enabling the
model to learn complex nonlinear relationships.
input
-10 10
0
-1
10
input
-10 10
0
-10