0% found this document useful (0 votes)
8 views104 pages

All Cards

Unit normalization scales each data point so that it has a unit norm or length of 1, which ensures the absolute scale of features does not dominate their contributions to the model and the relative magnitudes of features are more important. This normalization technique is commonly applied as a preprocessing step to control for differences in the scales of features before training machine learning models. Overall, unit normalization standardizes the values of each feature so they are on the same scale and contribute equally to the model.

Uploaded by

Vikas Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views104 pages

All Cards

Unit normalization scales each data point so that it has a unit norm or length of 1, which ensures the absolute scale of features does not dominate their contributions to the model and the relative magnitudes of features are more important. This normalization technique is commonly applied as a preprocessing step to control for differences in the scales of features before training machine learning models. Overall, unit normalization standardizes the values of each feature so they are on the same scale and contribute equally to the model.

Uploaded by

Vikas Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Unit Normalization

Unit normalization, also known as L2 normalization, scales data in a way that each data
point has a unit norm, meaning its length becomes 1. This normalization technique is
applied to ensure that the absolute scale of the features does not dominate their
contributions. It is particularly useful when the relative magnitudes of the features
are more important.

pre-processing machinelearningflashcards.com
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering
algorithm used to group samples that are close to each other in the feature space. It
defines clusters as dense regions separated by sparser regions, allowing it to discover
clusters of arbitrary shape.

Example DBSCAN Step


A random sample is selecte
If the sample has a minimum number of close neighbors, it is considered part of a
cluste
Step 2 is repeated recursively for all of the sample’s close neighbors, and then
neighbor’s close neighbor's, etc. These samples are the cluster’s core members
Once Step 3 runs out of samples, a new unvisited sample is selected and the whole
process begins again

Afterwards, samples not already assigned to a cluster are assigned to a nearby


cluster or marked as outliers.
clustering machinelearningflashcards.com
Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction
technique for classification problems. It aims to find a linear combination of
features that maximizes the separation between different classes while minimizing
the variance within each class.

sample

axis of maximum discrimination


between classes

dimensionality reduction machinelearningflashcards.com


bias-variance tradeoff
The bias-variance tradeoff represents the balance between a model's simplicity
(bias) and its sensitivity to fluctuations in the training data (variance). Striking the
right balance minimizes the total error and achieves the best performance.

Bias² is the squared difference Irreducible Error is the inherent


between the model's average noise in the data, which cannot be
prediction and the true values reduced by improving the model

Variance measures the model's


inconsistency across different training sets

bias vs. variance machinelearningflashcards.com


Bayes Error
Bayes error represents the lowest possible error rate that can be achieved by an
optimal classifier for a given problem. It stems from the inherent noise or
randomness in the data, making it impossible to achieve perfect classification even
with the best model.

Bayes error serves as a theoretical limit on the performance of any machine learning
algorithm for a specific task.

classifier machinelearningflashcards.com
Effect Of Feature Scaling On
gradient Descent
When the features have different scales, gradient descent may take longer to converge
or even fail to converge at all. Feature scaling, such as normalization or
standardization, ensures that all features are on a similar scale, allowing gradient
descent to avoid disproportionately large updates based on features with larger values.

direction of steepest
gradient not best for

scaled feature 1
finding the minimum
feature 1

feature 2 scaled feature 2


regularization machinelearningflashcards.com
Exploding Gradient
Exploding gradients are when gradients during the training process become extremely
large. This can lead to unstable model updates, making it challenging for the
optimization algorithm to converge. It often occurs when the gradients are multiplied
repeatedly, such as in neural networks with many layers.

gradient so steep that


the training process
overshoots the minima

neural networks machinelearningflashcards.com


K-Nearest Neighbors
K-Nearest Neighbors (k-NN) is used for both classification and regression. The
prediction for a sample is determined by the majority vote (in classification) or the
average (in regression) of the labels of its k nearest neighbors in the feature space.

3 nearest neighbors
sample

sample of unknown class

clustering machinelearningflashcards.com
Out-of-Bag Errors
Out-of-bag (OOB) error is a measure used in ensemble learning, specifically in
random forest algorithms. It estimates the prediction error of the model without the
need for a separate validation set. OOB error is computed by evaluating the model's
performance on the training data points that were not included in the construction
of each individual decision tree in the forest.

samples not selected for


train model vo this data subset are used
data model tin
g as to evaluate model
data

data model output

sampling with data model


replacement
tree-based models machinelearningflashcards.com
Regularization
Regularization is a technique used in machine learning to prevent overfitting by
adding a penalty term to the loss function during training. The penalty term
discourages complex models by imposing constraints on the model parameters.

Regularization techniques, such as L1 and L2 regularization, help to control the


trade-off between model complexity and the fit to the training data, improving the
model's ability to generalize to unseen data.

regularization machinelearningflashcards.com
Decision Trees
Decision trees recursively split the data based on the feature values that create the
highest information gains, creating a tree-like structure. Each leaf node represents
a predicted output value. Decision trees make predictions by traversing the tree
from the root to a leaf node based on the sample’s feature values. Decision trees
are highly interpretable, they can even be drawn in their entirety.

branch Tall short

split young old

leaf passed

failed passed
tree-based models machinelearningflashcards.com
Confusion Matrix
A confusion matrix is a table that helps evaluate the performance of a machine
learning model by comparing its predicted outcomes against the actual outcomes.

The rows correspond to the actual classes , while the columns correspond to the
predicted classes. Each cell represents the number of samples that belong to a
particular combination of actual and predicted classes.
predicted classes
class 1 class 2 class 3 class 4
Diagonal cells are correctly
predicted samples.
actual classes
class 1 12 6 1 1
class 2 0 20 0 0
class 3 8 1 10 2
class 4 0 5 2 8
evaluation machinelearningflashcards.com
Hinge Loss
Hinge loss is a loss function used in binary classification problems, especially support
vector machines. Hinge loss penalizes misclassified samples by measuring the
distance between the predicted class and the ground truth class, with a margin that
encourages correct classification. Hinge loss maximizes the margin between classes
and is robust against outliers.

true class label (either -1 or +1)

predicted class score

loss functions machinelearningflashcards.com


No Free Lunch Theorem
The No Free Lunch Theorem states that there is no one universal algorithm that
outperforms all others across all possible problems. Imagine every possible
underlying data generating distribution. Imagine we train every model using every
training algorithm on every distribution. On average models created by every
learning algorithm will have some test error. Therefore, there is no single best
training algorithm.

Put another way, there is no one-size-fits-all solution, and the effectiveness of an


algorithm depends on the characteristics of the problem.

theory machinelearningflashcards.com
Bias
Bias refers to the presence of systematic errors in a model's predictions due to
assumptions or oversimplification during the learning process. High bias can cause
underfitting, where the model is not complex enough to learn the nuances of the data
and make accurate predictions.

the model the real world

bias vs. variance machinelearningflashcards.com


Stochastic gradient descent

with momentum
Stochastic gradient descent with momentum uses an exponentially weighted average of
past gradients to update the momentum term and the model's parameters at each
iteration. It helps the optimizer maintain a more stable direction and speed up
convergence.
momentum parameter controls
influences of past gradients

momentum at
iteration t gradient loss function for
parameters at a single sample
iteration t
learning rate
optimizers machinelearningflashcards.com
Pre-Processing

Training And Test Sets


Pre-processing is applied only to the training set to avoid information leakage from
the test set. The purpose of the test set is to evaluate the model's performance on
unseen data that simulates real-world scenarios. If pre-processing techniques are
applied to the test set, it can introduce bias and compromise the integrity of the
evaluation.

theory machinelearningflashcards.com
Learning Curve
A learning curve is a plot that shows the relationship between the model's
performance and the size of the training set.

training score
performance

test score

size of training set

metrics machinelearningflashcards.com
False Positive Rate
False positive rate (FPR) is a metric used to assess the performance of a binary
classification model. It measures the proportion of negative instances that are
incorrectly classified as positive by the model.

number of negative
number of correctly
instances incorrectly
classified negative instances
classified as positive

metrics machinelearningflashcards.com
Random Initialization
Of Neural Network Parameters
Parameters are initialized with random values from a certain distribution, such as a
normal or uniform distribution, enabling symmetry breaking and promoting diverse
learning.

random numbers

input layer output layer

neural networks machinelearningflashcards.com


Upsampling
Upsampling is a strategy to handle data with imbalanced classes by replicating or
generating new samples from the minority class to achieve a more balanced
distribution. Upsampling provides more training samples for underrepresented
classes and reducing the bias towards the majority class.
majority class
imbalanced data upsampled data

minority class
pre-processing machinelearningflashcards.com
Minkowski Distance
Minkowski distance is used to measure the similarity or dissimilarity between two
points in a multidimensional space. It is a generalization of other distance metrics
like Euclidean distance and Manhattan distance.

When p=1 the Minkowski


distance is equal to the
Manhattan distance.



When p=2 the Minkowski


distance is equal to the
Euclidean distance.

metric machinelearningflashcards.com
Learning Rate
Learning rate controls the magnitude of the adjustments made to the parameters
based on the gradients of the loss function. A too small rate may result in slow
convergence, while a too large rate can lead to instability and overshooting the
optimal solution.

on
cti
un
sf
learning rate too large

los
learning rate too small

dimensionality reduction machinelearningflashcards.com


Data augmentation
Data augmentation is a technique to increase the size of the training set by creating
additional training samples from the existing data. This is achieved by applying
transformations to the original samples, such as rotating, flipping, or cropping images
or adding noise to audio data. Data augmentation helps to improve the robustness of the
model by exposing it to a wider range of variations in the data.

original sample transformed samples

pre-processing machinelearningflashcards.com
Principal Component Analysis
Principal Component Analysis (PCA) projects the data onto the axises (known as
principal components) along which the data varies the most (i.e. contains the most
information). The motivation for PCA is to reduce the number of features while only
losing a small amount information.

first principal component

sample

first principal component

theory machinelearningflashcards.com
Motivation For
Deep Networks
It has been shown that hypothetically a feedforward neural network with one layer is
good enough for most tasks. However, that single layer might have to be
impractically large. The motivation for deep networks is to have the same predictive
power with a much smaller number of units.

deep network

shallow network

neural networks machinelearningflashcards.com


Kernal PCA
Kernel PCA (Principal Component Analysis) is a nonlinear extension of PCA. It
employs kernel functions, such as the radial basis function (RBF) kernel, to map the
input data into a higher-dimensional feature space where linear PCA is performed.
This allows Kernel PCA to capture nonlinear relationships.

linearly inseparable data linear PCA reduces dimensionality but


not make classes linearly separable

K PCA reduces dimensionality while


making classes linearly separable

dimensionality reduction machinelearningflashcards.com


Hyperparameters Vs. Parameters

Hyperparameters are external settings that determine the behavior of the model.

Hyperparameters are typically set before training and include values such as

learning rate, regularization strength, and the number of hidden layers in a neural

network. Parameters are the internal variables of a model that are learned during

the training process. They are adjusted by the learning algorithm to optimize the

model's performance on the training data.

Example of a hyperparameter

number of trees in a random forest

Example of a parameter

weights in a neural network

theory machinelearningflashcards.com
Boosting
Boosting is an ensemble learning technique used in machine learning to improve the
performance of weak learners by combining their predictions. It works by iteratively
training a sequence of weak learners, with each learner focusing on correcting the
errors made by the previous one. The final model is a weighted combination of these
weak learners, resulting in a stronger, more accurate prediction.

original data weak learner 1 weak learner 2 strong learner

ensemble learning machinelearningflashcards.com


Model Complexity
Impact On Bias And Variance
Increasing model complexity reduces bias by allowing the model to capture more
intricate patterns in the data, but it also increases variance, making the model more
sensitive to noise and potentially leading to overfitting.

test loss
loss

2
variance
bias

theory model complexity machinelearningflashcards.com


exponential Linear Units
Exponential Linear Unit (ELU) is a neural network activation function. It is an
extension of the rectified linear unit (ReLU) function and address some of ReLU’s
limitations by introducing a negative region that allows for more robust learning. ELUs
can mitigate the vanishing gradient problem and improve the network's ability to
capture complex patterns in the data.

10

input

-10 10

controls slope in
the negative region
-10

neural network activation functions machinelearningflashcards.com


Brier Score
The Brier Score is a metric used to evaluate the accuracy of probabilistic predictions
from classification models. Scores range between 0 and 1, with a lower score
indicating better predictive performance.

predicted probability

(example: 0.8)

actual outcome

number of samples (example: 1.0)

metrics machinelearningflashcards.com
Feature Selection
Strategies
Univariate Selection: Select features based on their individual statistical significance
or correlation with the target variable.

Principal Component Analysis (PCA): Transform the original features into a smaller
set of uncorrelated components that capture most of the variance in the data.

L1 Regularization (Lasso): Encourage sparsity in the model by penalizing the absolute


magnitude of feature coefficients, effectively selecting only the most informative
features.

Feature Importance from Tree-Based Models: Use importance scores derived from
decision trees or random forests to rank and select the features based on their
predictive power.

strategies machinelearningflashcards.com
Accuracy
Accuracy is a metric used to evaluate the performance of a classification model. It
represents the proportion of correct predictions made by the model out of the total
number of predictions.

While accuracy can be a useful indicator of model performance, it may not always be
the best metric, especially when dealing with imbalanced datasets.

classification machinelearningflashcards.com
Neurons
A neuron is a fundamental building block of neural networks. Inspired the behavior of
a biological neuron, neural network neurons take in multiple input signals, applying
weights and biases to them, and then applying an activation function to produce an
output. Neurons work collectively in layers to process and transform input data,
enabling the network to learn complex patterns and make predictions.

inputs weights

sum activation output

bias
neural networks machinelearningflashcards.com
Mean Squared Error
Mean squared error (MSE) is a loss function commonly used in regression. MSE
measures the average squared difference between the predicted output and the true
output. It is the sum of the squared differences between the predicted and true values
and dividing it by the number of samples.

predicted output

true output
number of samples

loss functions machinelearningflashcards.com


Classification
Classification is a type of supervised machine learning task where an algorithm
learns to categorize samples into predefined classes.

decision boundary
sample

metrics machinelearningflashcards.com
Hidden Layers
Hidden layers in a neural network are the intermediate layers between the input and
output layers that process and transform input data. They are called hidden layers
because they are not directly exposed to the input data or the final output predictions.
Instead, they lie "hidden" between these layers.

input layer

output layer

hidden layers

neural networks machinelearningflashcards.com


Leaky Rectified Linear Unit
Leaky Rectified Linear Unit (Leaky ReLU) is a variant of the ReLU activation function
that addresses the "dying ReLU" problem. It introduces a small slope for negative input
values, allowing a small gradient to flow and avoiding the complete saturation of
neurons. This helps to alleviate the issue of dead neurons that do not contribute to the
learning process.

10

input

-10 10
0

determines slope
for negative values -10

neural network activation functions machinelearningflashcards.com


Minima Of The Loss Function
The minima of the loss function represent the points in the parameter space where
the loss is at its lowest. These points correspond to the optimal values of the
model's parameters that minimize the discrepancy between the predicted and actual
values. Finding the global minimum while avoiding local minima is often the goal of
model training.

loss function
local minimum

global minimum

local minimum

theory machinelearningflashcards.com
Categorical features
Categorical features represent distinct, non-numeric types or groups. These
features have a finite number of unique values, often representing different levels or
classifications.

Nominal Ordinal
No inherent ordering of the Inherent ordering of the categories.
categories. For example, types of For example, levels of pain.
fruit and colors.

features
machinelearningflashcards.com
Noisy Rectified Linear Unit
Noisy ReLU is a variant of the Rectified Linear Unit (ReLU) activation function that
introduces random noise during the forward propagation of the neural network. It
helps reduce overfitting in the model.

10

input random noise

-10 10
0

magnitude of the noise

-10

neural network activation functions machinelearningflashcards.com


Gradient Clipping
Gradient clipping limits the magnitude of gradients to a predefined threshold,
ensuring stable optimization and preventing issues such as exploding gradients. This
is typically achieved by rescaling gradients if their norm exceeds a threshold.

gradients

assignment operator

neural networks machinelearningflashcards.com


C
Inverse Of Regularization Strength
C, the inverse of regularization strength, is a hyperparameter used in some machine
learning models, such as logistic regression and support vector machines. A larger C
value signifies weaker regularization and a more complex model, while a smaller C
value corresponds to stronger regularization and a simpler model.

regularization strength

cost function L1 regularization

regularization machinelearningflashcards.com
Model Consistency
Model consistency refers to the property of a machine learning model where the
probability the predicted output and true output is greater than some number
approaches zero as the number of samples increases towards infinity.

predicted output error

As the number of samples true output output


approaches infinity

theory machinelearningflashcards.com
L1 Norm
L1 norm, also known as the Manhattan norm or the Taxicab norm, is a way to measure
the magnitude of a vector by summing the absolute values of its components.

absolute value of the


i-th component of x

norms machinelearningflashcards.com
Curse of Dimensionality
The curse of dimensionality is a phenomenon where the performance of algorithms
deteriorates as the number of features increases. This is due to the exponential
increase in the volume of the space, making it difficult to obtain enough data to properly
sample the space.

For example:

Imagine a dataset with only two features. It is easy to understand the relationships
between them. However, if we add more features the dimensionality of the data increases
and thus number of data points needed to fill the space becomes exponentially larger.

problems machinelearningflashcards.com
K-Means Clustering
K-Means clustering partitions samples into k clusters based on the similarity of the
samples. It assigns each sample to the cluster whose centroid (mean) is closest to it,
minimizing the within-cluster sum of squared distances. The algorithm iteratively
updates the cluster assignments and centroids until convergence is reached.

Example K-Means Clustering Steps

Select k random points to be cluster centroid


Repeat until no samples change cluster membership
Assign each sample to the nearest centroi
Update each cluster’s centroid to be the mean of all its member

clustering machinelearningflashcards.com
Weak Learners
A weak learner is a simple model performs slightly better than random guessing, but
not well enough to be useful on its own. They serve as building blocks in ensemble
methods, where many weak learners are combined to create a more powerful "strong"
learner.

original data weak learner 1 weak learner 2 strong learner

ensemble machinelearningflashcards.com
DownSampling
Downsampling is a strategy to handle data with imbalanced classes by randomly

removing samples from the majority class. Downsampling helps prevent the model

from being biased towards the dominant class and improves the performance of the

classifier by increasing the visibility of the minority class.

majority class
imbalanced data downsampled data

minority class

pre-processing machinelearningflashcards.com
BIAS
Intuition
Bias is the stubbornness of the learning algorithm in the face of new data. Some
bias is necessary for generalization but too much will produce poor models.

new data

bias is how easily this line bends


in response to the new data

data

bias vs. variance machinelearningflashcards.com


Backpropagation
Backpropagation, or backprop, is an optimization algorithm in neural networks that
adjusts weights to minimize error. It calculates the gradient of the loss function for
each weight using the chain rule, propagating the error backward. This fine-tunes the
weights for more accurate predictions.

forward pass through the network

input layer output layer

backpropagating error through the network


neural networks machinelearningflashcards.com
Epoch
An epoch is a single iteration of training a neural network on the entire training set.
During an epoch, the model sequentially processes each training sample, calculates the
loss, and updates its parameters based on the gradients. The number of epochs
determines how many times the model iterates through the entire training set,
allowing it to learn and refine its parameters over multiple passes.

Imagine a training set with 3 samples: A, B, and C

A
neural network
B 1 epoch

C
neural networks machinelearningflashcards.com
Early stopping
Early stopping is a technique to prevent overfitting by stopping the training process
before the model becomes too complex. It involves monitoring a validation metric
(e.g. validation loss) and terminating training when the metric no longer improves.
This helps to find the optimal point where the model performs well on unseen data.

testing set
validation metric

stop training
here

training set

training iterations
regularization machinelearningflashcards.com
Elastic NET
Elastic Net is a regularization technique that combines both L1 (Lasso) and L2 (Ridge)
regularization to improve the performance of linear regression models. This allows
Elastic Net to simultaneously perform feature selection and handle multicollinearity.

L2 penalty
L1 penalty

linear regression

regularization machinelearningflashcards.com
Gradient Cliffs
A Gradient cliff is where the gradients of a neural network’s loss function experience
an extreme and sudden change. This happens when a model's parameters are updated,
and the gradients become very large or very small. Gradient cliffs can hinder
convergence and make the optimization algorithm overshoot a minimum.

sudd
en d
ecre
ase i
n loss
caus
es op
timi
zer
to ov
ersh
oot
gradient cliff

minimum

neural networks machinelearningflashcards.com


Kullback-Leibler divergence loss
Kullback-Leibler (KL) divergence loss is a loss function commonly used in generative
models to measure the difference between two probability distributions, such as the
predicted distribution of generated data and the true distribution. It is calculated as the
sum of the product of the true probability and the logarithm of the ratio between the
true and predicted probabilities.

true probability
distribution
predicted probability
distribution

possible value

loss functions machinelearningflashcards.com


Stochastic gradient descent
Stochastic gradient descent (SGD) is an optimization algorithm used to update the
model's parameters by iteratively minimizing the loss function with a subset of the
training data at each step. SGD randomly selects a small batch of training examples and
computes the gradient of the loss function with respect to the parameters. The gradient
is then used to update the model's parameters in the direction of steepest descent.

parameters at
iteration t

learning rate loss function for


a single sample
gradient

optimizers machinelearningflashcards.com
Common Output Layer
Activation Functions
An output layer activation function processes the outputs of the neural network's
final layer to produce a desired range of outputs, such as a probability distribution
over the predicted classes.

Sigmoid Activation Function


Softmax Activation Function

Maps any input value to a range between Scales the outputs of the model to
0 and 1, commonly used for binary represent a probability distribution over
classification problems. classes, commonly used for multi-class
classification problems.

input values

neural networks machinelearningflashcards.com


Model Complexity
As model complexity increases, it becomes more capable of capturing intricate
patterns in the training data, resulting in lower training error. However, if the model
becomes too complex, it may start to overfit the training data and perform poorly on
unseen test data, leading to higher test error.

testing set
errors

training set

model complexity
theory machinelearningflashcards.com
Feature Matrix
A feature matrix, also known as a design matrix or input matrix, is a two-
dimensional array in which rows represent individual samples and columns
represent the features. Each element contains the value of a specific feature for a
particular sample. The feature matrix is a input for model training.

features

samples

data machinelearningflashcards.com
Binary Cross-Entropy Loss
Binary cross-entropy loss measures the difference between the predicted and true
binary classification outputs. It is the negative logarithm of the predicted probability
of the true class.

true class

predicted probability of being the


class the model is trying to predict

loss functions machinelearningflashcards.com


Feature Importance
Feature importance is the influence that each feature has on the predictions. It helps
determine feature selection, interpretation of the model, and potential improvements
to the model's performance.

Example: The decision trees in a random forest make splits that maximize the
decrease in impurity. By calculating the mean decrease in impurity for each feature
across all trees, the importance of each features is quantified.

importance
feature

feature 1 feature 2 feature 3


theory machinelearningflashcards.com
Decision Tree Regression
Decision tree regression uses a decision tree structure to predict continuous

numerical values. It recursively splits the training data based on the features to

create a tree-like model, where each leaf node represents a predicted value. During

prediction, the algorithm follows the decision path in the tree to reach a leaf node

providing the predicted value.

decision tree predictions

data

tree-based models machinelearningflashcards.com


Minmax scaling
MinMax scaling, also known as min-max normalization, is a feature scaling technique
used to rescale the values of a feature into a specified range, typically between 0 and
1. It works by subtracting the minimum value from each data point and then dividing
it by the range (maximum value minus minimum value). This ensures that the
transformed feature values are proportionally adjusted to fit within the desired
range.

value

pre-processing machinelearningflashcards.com
Bootstrap Sampling
Bootstrap sampling simulates obtaining many new datasets by repeated sampling
with replacement from the original dataset. By generating multiple samples from
the original dataset, it allows for better assessment of model performance.
original data
x1 x2 random sampling with
replacement
1 A 10
2 B 20
3 C 30 x1 x2 x1 x2 x1 x2

values 1 A 10 2 B 20 2 B 20
2 B 20 1 A 10 3 C 30
3 A 10 2 B 20 3 C 30
new dataset 1 new dataset 2 new dataset 3
sampling machinelearningflashcards.com
What Does It Mean To
Learn In Machine Learning
Learning refers to the process by which a machine learning model automatically
adjusts its internal parameters or weights based on the provided training data.
Through this iterative process, the model learns patterns, relationships, and
underlying structures in the data, allowing it to make predictions or perform tasks
without being explicitly programmed.

“A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P if its performance wat tasks in T, as
measured by P, improves with experience E.” -Tom Mitchell (1997)

theory machinelearningflashcards.com
Gradient Descent
Gradient descent is an optimization algorithm that minimizes the loss function and
finds optimal values for a model's parameters. Gradient descent iteratively updates
the parameters in the direction of steepest descent by computing the gradients of
the loss function with respect to the parameters. This process continues until a
minimum of the loss function is reached or other criteria are met.

parameter values at iteration 1

on
cti
un
iteration 2

sf
los
iteration 3

iteration 3
iteration 4

optimizers machinelearningflashcards.com
Ensemble Methods
Ensemble methods make more accurate predictions by aggregating the output of many
individual models to through voting, averaging, or weighted combinations. Ensemble
methods can often achieve better performance, enhance generalization, and improve
robustness compared to using a single model.

ensemble machinelearningflashcards.com
One Hot Encoding

One Hot Encoding is a technique used to represent categorical variables as binary

vectors. It converts each category into a binary feature where a value of 1 indicates

the presence of that category and 0 indicates its absence. This encoding is useful for

machine learning algorithms that cannot directly handle categorical data and require

numerical inputs.

one hot encoding of


Fruit
Apple
Pear

Fruit column
Apple
1
0

Pear
0
1

Apple
1
0

Pear
0
1

Apple 1 0

pre-processing machinelearningflashcards.com
Adaboost
AdaBoost, or Adaptive Boosting, is an ensemble learning technique that combines
multiple weak classifiers to form a strong classifier. By adjusting the weights of
misclassified instances and training new weak classifiers on these weights, AdaBoost
improves classification performance.

Example Adaboost Step


Assign every sample an initial weight value
Train a “weak” model - often a decision tree
For each sample
If predicted correctly, decrease weigh
If predicted incorrectly, increase weigh
Train a new weak model where samples with greater weight are given higher
priorit
Repeat steps 3 and 4 until samples are perfectly predicted for a preset number
of weak models are trained.

ensemble learning machinelearningflashcards.com


MNIST Dataset
The Modified National Institute of Standards and Technology (MNIST) dataset provides
a collection of 70,000 handwritten digits for use in image recognition tasks.
Researchers commonly use this dataset to train and test algorithms.

5 pixels of a
handwritten digit
10

15

20

25

0 5 10 15 20 25
datasets machinelearningflashcards.com
Bag Of Words
A bag of words is a technique in natural language processing that converts text into
numerical format by creating a vocabulary of unique words and counting their
occurrences. Ignoring word order, each word frequency serves as a feature in a
feature vector.

Raw Text Bag Of Words


the cat is on mat dog sat

"The cat is on the mat. Sentence 2 1 1 1 1 0 0

"The dog sat on the mat." Sentence 2 0 0 1 1 1 1

natural language processing machinelearningflashcards.com


False Negative Rate
False negative rate (FNR) is a metric used to evaluate the performance of a binary
classification model. It measures the proportion of positive instances that are
incorrectly classified as negative by the model.

number of positive
number of correctly
instances incorrectly
classified positive instances
classified as negative

metrics machinelearningflashcards.com
Generalization
Generalization refers to the ability of a model to perform well on data that it has
not been trained on. It indicates the model's capacity to capture underlying patterns
and make accurate predictions on new, unseen examples.

training data model model applied


to unseen data

theory machinelearningflashcards.com
Training Error Rate
Training error rate is the number of incorrect predictions divided by the total
number of predictions in the training set. While a lower training error rate is
generally better, it's important to avoid overfitting, where the model performs well on
the training data but poorly on unseen data.

metrics machinelearningflashcards.com
K-Fold Cross-Validation
K-Fold Cross-Validation assesses the performance and generalization ability of a
machine learning model. It involves splitting the dataset into k equally sized folds,
using k-1 folds for training and the remaining fold for validation. This process is
repeated k times, with each fold serving as the validation set exactly once.

cross-validation is repeated k times


fold used for
validation

folds used for


training

cross-validation machinelearningflashcards.com
Mean Shift Clustering
Mean Shift clustering is an unsupervised algorithm used to discover clusters in data
without the need for specifying the number of clusters in advance. It works by
iteratively shifting the data points towards the mean of their local neighborhood
until convergence.

Mean Shift Clustering Analogy

Imagine a foggy football field with 100 people standing on it. Because of the fog,
people can only see a short distance. Every 10 seconds, each person looks around and
takes a step in the direction facing the most people they can see.

As time goes on, people start to group up as they repeatedly take steps towards
larger and larger crowds. The end result is clusters of people around the pitch.

clustering machinelearningflashcards.com
Categorical Cross-Entropy Loss
Categorical cross-entropy loss measures the difference between the predicted and
true probability distributions of multiple classes. It is calculated as the negative
logarithm of the predicted probability of the true class.

number of classes predict probability of


the i-th class

true probability of
the i-th class

loss functions machinelearningflashcards.com


Linear Activation Function
The linear activation function, also known as the identity function, is a simple
activation function commonly used as a placeholder in neural networks. It returns
the input value as the output without any nonlinearity or transformation.

10

input
-10 10
0

input

-10

neural network activation functions machinelearningflashcards.com


Hyperparameter Tuning
Hyperparameter tuning is the process of finding the optimal values for
hyperparameters. Tuning involves systematically exploring different combinations of
hyperparameter values, evaluating the model's performance with those
hyperparamters using a validation set or cross-validation, and selecting the
hyperparameter values that result in the model with the best performance.

hyperparameter tuning machinelearningflashcards.com


K-NN Neighborhood Size
Neighborhood size in k-nearest neighbors (k-NN) effects the model's bias-variance
trade-off. A smaller neighborhood size (lower k) leads to flexible decision boundaries,
reducing bias but increasing variance. A larger neighborhood size (higher k) smooths
out the decision boundaries, reducing variance but increasing bias.

11 nearest neighbors
sample

sample of unknown class


3 nearest neighbors

sample of unknown class

clustering machinelearningflashcards.com
Random Forests
Random Forests is an ensemble learning method that combines multiple decision
trees to make predictions. Each tree in the forest is trained on a random subset of
the training data with replacement, and a random subset of features is considered at
each split. By aggregating the predictions of individual trees, Random Forests can
handle complex datasets, reduce overfitting, and provide robust predictions.
each tree votes by predicting the target class

votes are tallied


VOTE to reach the
final prediction

forest of decision trees


tree-based models machinelearningflashcards.com
Training, Validation,
And Test Sets
The training set is the portion of the data used to train the machine learning model
by adjusting its parameters or weights.

The validation set is used to tune the hyperparameters of the model and assess its
performance during training.

The test set is used to evaluate the final performance of the trained model after it
has been trained and validated. The test set provides an unbiased assessment of how
well the model generalizes to unseen data.

theory machinelearningflashcards.com
Gini Index
In Random Forests
Gini index is a measure of impurity used for splitting nodes in the decision trees
that make up the random forest. Gini index quantifies the degree of class mixing
within a node by calculating the probability of misclassifying a randomly chosen
sample in the node. The Gini index is minimized when a node contains pure samples
of a single class, and it is used as a criterion to determine the optimal splits in the
decision trees during the construction of the random forest.

total number of classes

class probabilities probability of class i


within the node
tree-based models machinelearningflashcards.com
Area Under The Curve
The area under the curve (AUC) is a performance metric used to evaluate binary
classification models, particularly in relation to the receiver operating
characteristic (ROC) curve. It quantifies the model's ability to correctly classify,
with a higher AUC value (ranging from 0 to 1) indicating better classification
performance. Random guessing creates an AUC of 0.5. A perfect classifier achieves
an AUC of 1.

receiver operating
true positive rate

characteristic curve

area under the curve

false positive rate


metrics machinelearningflashcards.com
Mini-Batch
Unlike batch training where one epoch uses the training set, mini-batch training
divides the data into smaller batches, allowing for more frequent updates and
improved computational efficiency.

Each mini-batch is randomly sampled from the training set and used to compute the
gradients and update the model parameters using optimization algorithms such as
stochastic gradient descent.

The size of the mini-batch is typically chosen based on factors such as available
memory, computational resources, and trade-offs between convergence speed and
parameter update variance.

neural networks machinelearningflashcards.com


L2 Norm
L2 norm, also known as the Euclidean norm, is a way to measure the magnitude of a
vector by taking the square root of the sum of the squared values of its components.

squared value of the


i-th component of x

norms machinelearningflashcards.com
Dropout
Dropout is a regularization technique used in neural networks where randomly
selected neurons are temporarily deactivated during training. This forces the
network to learn more robust features. During inference, all neurons are active, but
their outputs are scaled by the dropout rate, ensuring that the model can make
predictions with all neurons contributing effectively.

deactivated neuron

input layer output layer

regularization machinelearningflashcards.com
F1 Score
F1 score is a metric used to evaluate the performance of a binary classification model,
combining precision and recall into a single measure. It is the harmonic mean of
precision and recall, providing a balanced measure of the model's accuracy.

ratio of true positive


predictions to the
total actual positives

ratio of true positive


predictions to the total
predicted positives

metrics machinelearningflashcards.com
Overfitting Vs. Underfitting
Overfitting happens when a model learns to fit the training data too closely,
capturing both the underlying patterns and noise in the data. Underfitting, happens
when a model is too simple or lacks the capacity to capture the underlying patterns
in the data.

sample underfitting

overfitting

theory machinelearningflashcards.com
Weight Decay
Weight decay is a regularization technique for neural networks. Weight decay adds a
penalty, usually L2 norm, to the loss function. This reduces the magnitude of the
model's weights and thereby improves performance on unseen data.

new loss function

L2 norm of the
original loss controls model parameters
function regularization
power

regularization machinelearningflashcards.com
Grid Search
Grid search systematically searches for the optimal hyperparameters for a model. It
involves defining a list of potential hyperparameter values, training a model using
each set of hyperparameters, then evaluating each models' performance using
cross-validation.

By exhaustively trying many possible combinations, grid search helps identify the
best set of hyperparameters that yield the highest performance.

hyperparameter tuning machinelearningflashcards.com


Big 0
Big O notation describes the performance of an algorithm by measuring the
relationship between its input size and the number of operations it takes to
complete. It helps to compare and analyze the efficiency of different algorithms,
focusing on their worst-case behavior

n)
O(n
O(n n)

)
number of operations

O(n 2

log
O(n O(log n)

O(1)

input size
algorithms machinelearningflashcards.com
Error Types
Type I error is a false positive, incorrectly rejecting the null hypothesis. Type II error
is a false negative, failing to detect a true effect. Here is how to remember them:

Type
False Positive
I
Type
False Negative
I I

neural networks machinelearningflashcards.com


RMSprop
RMSprop optimizer adjusts the learning rates of each parameter based on the root
mean square of the past gradients. It uses a moving average of the squared gradients
to normalize the gradient updates and prevent oscillations in optimization.
gradient operator
gradient of the loss
function at iteration t
loss function for
a single sample
moving average of the
squared gradients at
decay rate parameter
iteration t
learning rate
model parameters
at iteration t small constant to avoid
division by zero

optimizers machinelearningflashcards.com
Imputation
Imputation is the process of filling in missing values in the data.

Average Imputation: Replace missing values with the mean or median value of the
feature. This is simple and effective.

Regression Imputation: Predict missing values using regression models based on


other correlated features. More accurate but more computationally costly.

K-Nearest Neighbor Imputation: Predict the missing values by averaging or


interpolating using the values of the missing value’s nearest neighbors.

pre-processing machinelearningflashcards.com
Sigmoid
The sigmoid activation function converts the input value to a range between 0 and 1,
which is interpreted as a probability. It is commonly used in binary classification tasks.

input

-10 10
0

-1

neural network activation functions machinelearningflashcards.com


Imputation Using K-NN
Imputation using k-nearest neighbors (k-NN) fills in missing values in the data. It
involves finding the k nearest neighbors to a data point with missing values and using
their known values to estimate and impute the missing values.

sample of unknown class


sample
nearest neighbors

pre-processing machinelearningflashcards.com
Loss Functions
A loss function, also called a cost function, measures the difference between the
predicted output of a machine learning model and the true output, which is used to
optimize the model's parameters during training. They are the function we are
trying to train a model to minimize. Here are some examples:

Mean Squared Error (MSE) Categorical Cross-Entropy Loss


Average squared difference Difference between predicted and
between predicted and true true probability distributions of
values multiple classes

Binary Cross-Entropy Loss Kullback-Leibler (KL) Divergence Loss

Difference between predicted and Difference between predicted and true


true binary classification outputs probability distributions

loss functions machinelearningflashcards.com


Bagging
Bootstrap aggregation, or bagging, is a method that improves model stability and
accuracy by combining multiple base models, each trained on a random subset of the
training data. By averaging their predictions, bagging reduces variance and
overfitting in the final model.

train model vot


data model
ing
data
data model output

sampling with data model


replacement

bagging machinelearningflashcards.com
Tanh
The tanh (hyperbolic tangent) activation function maps the input to a range between
-1 and 1. It is useful for capturing both positive and negative values, enabling the
model to learn complex nonlinear relationships.

input
-10 10
0

-1

neural network activation functions machinelearningflashcards.com


Motivation For
Deep Learning
“Shallow learning” algorithms like support vector machines work well when we have
well structured feature data. However, they often perform poorly in high dimensional
spaces such as those found in computer vision and natural language. Furthermore,
deep learning techniques often fare better with less pre-processing of the training
data.

neural networks machinelearningflashcards.com


Rectified Linear Unit
The rectified linear unit (ReLU) is an activation function that returns input directly if
it is positive, and zero otherwise. ReLU is computationally efficient, helps mitigate the
vanishing gradient problem, and promotes sparsity in neural networks.

10

input

-10 10
0

-10

neural network activation functions machinelearningflashcards.com

You might also like