0% found this document useful (0 votes)

4 views

LInear

The document discusses the significance of partial derivatives in multivariable calculus, particularly in understanding function changes with respect to individual variables. It introduces gradient descent as an optimization technique used to minimize functions, detailing its iterative process, variants, and applications in machine learning. The document also highlights challenges faced in gradient descent, such as local minima and learning rate selection, while providing examples and Python code for practical implementation.

Uploaded by

khuddush89

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

LInear

Uploaded by

khuddush89

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

1 Linear Algebra-Chapter 4 and 5: M.

Sc Data Science
1.1 Interpretation of Partial Derivative Values
Within the domain of multivariable calculus, partial derivatives assume fundamental significance in
understanding how a function changes with respect to one of its variables while holding all other variables
constant. In essence, a partial derivative quantifies the rate at which a function changes with respect to
a single variable while keeping all other variables constant.
Consider a function f (x, y), which represents the height of a hilly terrain at any given position (x, y).
The partial derivative of f with respect to the variable x, represented as ∂f∂x , indicates the rate of change
in height in the x-direction. A positive value signifies an upward change in height while moving in the
positive x-direction, whereas a negative value denotes a downward change. Similarly, ∂f ∂y measures the
variation in the y-direction.
The magnitude of partial derivatives is significant for interpretation. A larger magnitude indicates
steeper terrain or a more rapid rate of change, while a smaller magnitude implies a gentler incline.
Furthermore, the sign of a partial derivative provides key insights: a positive value signifies an increase
in the function with respect to the corresponding variable, while a negative value indicates a decrease.
A partial derivative value of zero suggests the possibility of a local maximum, minimum, or saddle point
with respect to that variable.
Partial derivatives are crucial in sensitivity analysis across various domains, including economics,
physics, and machine learning. For instance, in economics, one can study how changes in the price of
a product, holding all other variables constant, affect its demand. Thus, partial derivatives transcend
mathematical abstraction and are pivotal in comprehending the behavior of multivariable functions across
diverse scenarios.

2 Introduction to Gradient Descent

Gradient descent is an iterative optimization technique used to minimize a function. It’s particularly
useful when we have a complex function, such as a machine learning model’s loss function, that does not
have an easy closed-form solution for its minimum. Instead of directly finding the minimum analytically,
gradient descent updates the parameters iteratively in the direction of the steepest decrease in the
function’s value.
The general steps in gradient descent can be described as follows:
1. Start with an initial guess: Choose a starting point for the parameters. This is usually done
randomly or heuristically.
2. Compute the gradient: The gradient represents the slope of the function at the current point.
It tells us the direction of the steepest increase of the function.
3. Update the parameters: Move in the opposite direction of the gradient, as we want to minimize
the function, not maximize it.
4. Repeat: The process is repeated for a set number of iterations or until the change in the function
value is sufficiently small.
Mathematically, the update rule is:
xnew = xold − α · ∇f (xold )
where:
• xold is the current value of the parameters,
• ∇f (xold ) is the gradient of the function at xold ,
• α is the learning rate, which controls the size of the step taken in the direction of the negative
gradient.

2.1 Variants of Gradient Descent

While the fundamental idea remains the same, gradient descent has several variants, each suited for
different types of problems.

1
2.1.1 Batch Gradient Descent (BGD)
In this method, the gradient of the entire dataset is computed before updating the parameters. This
means that every iteration involves computing the gradient for all data points, making it computationally
expensive for large datasets.
• Pros: Can converge to the global minimum if the learning rate is chosen correctly.
• Cons: Computationally expensive and slow for large datasets.

2.1.2 Stochastic Gradient Descent (SGD)

Instead of computing the gradient using the whole dataset, SGD computes the gradient using just one
randomly selected data point at a time. This introduces noise but leads to faster updates.
• Pros: Faster convergence, works well for large datasets, can escape local minima.
• Cons: The path to convergence is noisy, and the function value may fluctuate significantly.

2.1.3 Mini-batch Gradient Descent

This is a compromise between batch gradient descent and stochastic gradient descent. It computes
the gradient using a small, randomly selected subset (mini-batch) of the dataset rather than the entire
dataset or just one data point. This provides faster convergence while retaining some stability in the
update.
• Pros: Faster convergence than batch GD and more stable than SGD. Often leads to better gener-
alization.
• Cons: Requires choosing an appropriate mini-batch size.

2.2 Convergence and Learning Rate

The learning rate α is crucial to the behavior of the gradient descent algorithm:
• If α is too large, the algorithm may overshoot the minimum and diverge.
• If α is too small, the algorithm may converge too slowly or get stuck in local minima.
Choosing the right learning rate is critical. In practice, adaptive learning rates like Adagrad, RMSprop,
and Adam are often used, which adjust the learning rate during training based on past gradients.

2.3 Convergence Criteria

Gradient descent converges when:
• The change in the function value is smaller than a certain threshold.
• The gradient becomes very small (close to zero).
• The maximum number of iterations is reached.
Convergence can be slow, especially in the presence of ill-conditioned functions (those where the
gradient behaves erratically across the domain).

2.4 Example Algorithm for Gradient Descent

Here’s an example of how gradient descent is applied to minimize a simple quadratic function.

2.4.1 Example Function:

Consider the quadratic function:
f (x) = x2
The derivative (gradient) of f (x) is:
f ′ (x) = 2x
The update rule is:
xnew = xold − α · 2xold

2
2.4.2 Python Code:
Below is the Python code demonstrating gradient descent for the function f (x) = x2 . The learning rate
α is set to 0.1, and the algorithm runs for 100 iterations, starting from x = 10. The resulting path of
the gradient descent is visualized on the plot.

Python Code Example for Gradient Descent

import numpy as np
import matplotlib.pyplot as plt

# Function and its derivative

def f(x):
return x**2

def df(x):
return 2*x

# Gradient Descent Parameters

alpha = 0.1 # Learning rate
iterations = 100 # Number of iterations
x = 10 # Initial guess

# Store the values of x during the process

x_values = [x]

# Gradient Descent Loop

for i in range(iterations):
x = x - alpha * df(x)
x_values.append(x)

# Plotting the results

x_range = np.linspace(-10, 10, 100)
y_range = f(x_range)

plt.plot(x_range, y_range, label=’f(x) = x^2’)

plt.scatter(x_values, f(np.array(x_values)), color=’red’, label=’Gradient Descent Path’)
plt.xlabel(’x’)
plt.ylabel(’f(x)’)
plt.legend()
plt.title(’Gradient Descent on f(x) = x^2’)
plt.show()

print(f"Final value of x: {x}")

3 Output of the Gradient Descent Process

The output of the Python code shows the final value of x after 100 iterations. The gradient descent
process successfully converges to the global minimum at x = 0, as expected for the quadratic function
f (x) = x2 .
Output:
Final value of x: 0.0

3.1 Applications of Gradient Descent

Gradient Descent has many applications, especially in machine learning and deep learning, where opti-
mization of parameters is critical.

3
3.1.1 Linear Regression
Gradient Descent is used to minimize the Mean Squared Error (MSE) in linear regression problems. By
updating the weights in the direction of the negative gradient, we find the best-fitting line.

3.1.2 Logistic Regression

In binary classification problems, gradient descent is used to minimize the logistic loss function (cross-
entropy loss) for parameter optimization.

3.1.3 Neural Networks

Training neural networks relies heavily on gradient descent. The backpropagation algorithm computes
gradients for the weights of each layer and updates them iteratively. Variants like Stochastic Gradient
Descent (SGD) and Adam are commonly used to optimize the loss function in neural networks.

3.1.4 Support Vector Machines (SVMs)

In SVMs, gradient descent can be used to optimize the hinge loss function for classification tasks.

3.1.5 Reinforcement Learning

Gradient-based methods like policy gradient algorithms use gradient descent to optimize the policy and
value functions.

3.2 Challenges and Improvements

• Local Minima: Gradient descent may get stuck in local minima, especially in non-convex func-
tions. Techniques like Simulated Annealing or using multiple initializations can help mitigate this
issue.
• Learning Rate: A fixed learning rate can lead to slow convergence or divergence. Adaptive
learning rates like Adam or Adagrad are often more effective.
• Vanishing/Exploding Gradients: In deep networks, gradients can become too small or large,
leading to poor convergence. This is often addressed by normalization techniques like Batch Nor-
malization.

3.3 Conclusion
Gradient Descent is a powerful optimization technique that underpins much of modern machine learning
and AI. Its simplicity, combined with its ability to handle high-dimensional data efficiently, makes it
indispensable in practice. However, the choice of learning rate and handling local minima can be tricky,
and thus variants like mini-batch and adaptive gradient descent methods are often preferred in real-world
applications.

4 Understanding Gradient Descent Using Multivariable Calcu-

lus
Gradient descent is an optimization algorithm that becomes increasingly important in the context of
functions with multiple variables, as is often the case in machine learning and deep learning. Multi-
variable calculus, especially the concept of gradients and partial derivatives, provides the foundation for
understanding how gradient descent operates in high-dimensional spaces.

4.1 Multivariable Functions and Gradient Descent

Let f : Rn → R be a differentiable function of n variables, where f (x1 , x2 , . . . , xn ) represents a cost or
loss function that needs to be minimized. In machine learning, f might represent the loss of a model,
with the xi ’s corresponding to the model parameters (such as weights and biases).
The objective of gradient descent is to iteratively adjust the parameters to find the optimal values
that minimize f .

4
4.2 The Gradient and Direction of Steepest Ascent
The gradient of a function f at a point x = (x1 , x2 , . . . , xn ) is a vector of partial derivatives with respect
to each variable:
∂f ∂f ∂f
∇f (x) = , ,...,
∂x1 ∂x2 ∂xn
The gradient points in the direction of the steepest ascent of the function. This means that if we follow
the direction of the gradient, we will increase the value of the function the most rapidly. To minimize f ,
we move in the opposite direction of the gradient, following the direction of steepest descent.

4.3 Update Rule and Learning Rate

In gradient descent, the parameters are updated in the opposite direction of the gradient, as shown by
the following update rule:
xnew = xold − α · ∇f (xold )
where:
• xold = (x1 , x2 , . . . , xn ) is the current parameter vector,
• ∇f (xold ) is the gradient at the current parameter values,

• α > 0 is the learning rate, which controls the step size.

The learning rate α is a crucial hyperparameter in gradient descent. A small α may cause the
algorithm to converge very slowly, while a large α may lead to overshooting, causing the algorithm to
diverge.

4.4 Iterative Process and Convergence

Gradient descent is an iterative process where, starting from an initial guess for xold , the algorithm
updates the parameters x in the direction of the negative gradient. This process continues until one of
the following convergence criteria is met:

• The change in the function value is below a predefined threshold.

• The magnitude of the gradient is sufficiently small (∇f (x) ≈ 0).
• A maximum number of iterations is reached.
Convergence is the point at which further updates no longer significantly change the parameters or
improve the function value.

4.5 Example: Gradient Descent for a Multivariable Function

Consider a simple quadratic function of two variables:

f (x1 , x2 ) = x21 + x22

The gradient of f is:

∂f ∂f
∇f (x1 , x2 ) = , = (2x1 , 2x2 )
∂x1 ∂x2
Using gradient descent, the update rule becomes:

xnew
1 = xold old
1 − α · 2x1

xnew
2 = xold old
2 − α · 2x2

This update rule moves the values of x1 and x2 towards the minimum of the function.

5
4.5.1 Python Code Example
The following Python code demonstrates gradient descent applied to the above quadratic function:

import numpy as np
import matplotlib.pyplot as plt

# Function and its gradient

def f(x1, x2):
return x1**2 + x2**2

def grad_f(x1, x2):

return np.array([2*x1, 2*x2])

# Gradient Descent Parameters

alpha = 0.1 # Learning rate
iterations = 50 # Number of iterations
x = np.array([5.0, 5.0]) # Initial guess

# Store the values of x during the process

x_values = [x]

# Gradient Descent Loop

for i in range(iterations):
grad = grad_f(x[0], x[1])
x = x - alpha * grad
x_values.append(x)

# Plotting the results

x1_range = np.linspace(-5, 5, 100)
x2_range = np.linspace(-5, 5, 100)
X1, X2 = np.meshgrid(x1_range, x2_range)
Z = f(X1, X2)

plt.contour(X1, X2, Z, 50, cmap=’jet’)

x_values = np.array(x_values)
plt.plot(x_values[:, 0], x_values[:, 1], ’ro-’, label=’Gradient Descent Path’)
plt.xlabel(’x1’)
plt.ylabel(’x2’)
plt.legend()
plt.title(’Gradient Descent on f(x1, x2) = x1^2 + x2^2’)
plt.show()

print(f"Final values: x1 = {x[0]}, x2 = {x[1]}")

4.6 Gradient Descent in High-Dimensional Spaces

In practical applications, such as training deep neural networks, the functions to minimize are often
high-dimensional. For instance, deep learning models may have millions of parameters, each of which
contributes to the overall loss. The gradient in such high-dimensional spaces can be computed using
backpropagation, an efficient method that propagates gradients backward through the layers of a neural
network.

4.7 Applications of Gradient Descent in Machine Learning

Gradient descent is widely used in various machine learning and deep learning algorithms:
• Linear Regression: Minimizes the sum of squared errors between predicted and actual values by
updating the model parameters.

6
• Logistic Regression: Used for classification problems, where the gradient descent algorithm
minimizes the log loss (cross-entropy loss).
• Neural Networks: The backpropagation algorithm uses gradient descent to update the weights
of neurons to minimize the loss function.

• Support Vector Machines (SVMs): Gradient descent can be used to optimize the hinge loss
function in SVMs.
• Reinforcement Learning: Policy gradient methods use gradient descent to optimize the agent’s
policy.

4.8 Challenges and Improvements in Gradient Descent

While gradient descent is powerful, it faces several challenges:
• Local Minima and Saddle Points: In non-convex functions, gradient descent can get stuck in
local minima or saddle points. Methods like Stochastic Gradient Descent (SGD) and Momentum
help mitigate these issues by introducing randomness or memory into the updates.
• Learning Rate Selection: Choosing an appropriate learning rate is crucial. Too large a value
can lead to overshooting, while too small a value can slow convergence. Adaptive methods like
Adam dynamically adjust the learning rate during training.

• Exploding/Vanishing Gradients: In deep neural networks, gradients can either become too
large or too small, making training unstable or slow. Techniques like Gradient Clipping and Batch
Normalization help address these issues.

4.9 Conclusion
Gradient descent, when applied to multivariable functions, is a cornerstone of optimization in machine
learning and deep learning. It allows for efficient parameter updates in high-dimensional spaces, en-
abling the training of complex models. However, choosing an appropriate learning rate, dealing with
local minima, and managing large datasets are key challenges that require careful handling in practical
applications.

5 Maxima and Minima in Multivariable Calculus

In calculus, maxima and minima represent the highest and lowest values of a function, respectively. For
multivariable functions, the analysis becomes more intricate, as the functions map vectors from Rn to
R, creating surfaces in three or higher dimensions.
Critical points occur where the gradient ∇f is zero, indicating no increase or decrease in the func-
tion’s immediate vicinity. To classify these points, the second derivative test is used. For multivariable
functions, the Hessian matrix, composed of second-order partial derivatives, is examined. The nature of
the critical point is determined as follows:

• Local Minimum: The Hessian matrix determinant is positive, and its leading diagonal entries
are positive.
• Local Maximum: The Hessian determinant is positive, but leading diagonal entries are negative.
• Saddle Point: The Hessian determinant is negative.

Global extrema may also occur at the boundaries of the function’s domain. Evaluating function
values at critical points and along the boundary helps identify global maxima and minima.
The concepts of maxima and minima in multivariable calculus are essential in optimizing complex
systems and understanding natural phenomena. By analyzing functions in multidimensional spaces,
one can extract insights concealed within high-dimensional data, enabling advancements in science and
engineering.

7
6 Application of Gradient Descent in Optimisation Problems
Abstract
Optimisation problems are fundamental across various disciplines, including machine learning, economics,
and engineering. These problems often involve minimising or maximising an objective function. Gradient
descent, an iterative optimisation method, plays a pivotal role in solving such problems, particularly
when analytical solutions are challenging to obtain. This paper explores the mechanism, significance,
and applications of gradient descent in optimisation tasks.

7 Introduction
The core goal of gradient descent is to iteratively update parameters to minimise a cost function. By
computing the gradient of the function with respect to the parameters at a specific point, the algorithm
moves in the opposite direction of the gradient, ensuring a consistent reduction in the function’s value.
Over iterations, this approach converges to the function’s minimum.
The learning rate, a crucial parameter, determines the step size for each iteration. A high learning
rate risks overshooting the minimum, leading to instability, while a low learning rate results in slow
convergence. Adaptive strategies are often employed to balance these trade-offs.

8 Applications of Gradient Descent

8.1 Machine Learning
Gradient descent is integral to machine learning, especially in training neural networks. The algorithm
adjusts weights and biases to minimise the loss function, which measures the discrepancy between pre-
dicted and actual outputs. Beyond neural networks, gradient descent is utilised in logistic regression,
support vector machines, and other machine learning models.

8.2 Economics and Engineering

In economics, optimisation techniques maximise utility functions, while in engineering, gradient descent
fine-tunes system parameters for optimal performance.

9 Challenges and Variations

Non-convex functions can lead to local minima, yielding suboptimal solutions. Variants such as stochas-
tic gradient descent (SGD), mini-batch gradient descent, and momentum-based methods address these
challenges effectively.

10 Conclusion
Gradient descent is a robust, iterative approach for solving complex optimisation problems. Its concep-
tual simplicity and wide applicability make it a fundamental tool across disciplines.

11 Summary
• Module 4 explored multivariable calculus and its real-world implications.
• Partial derivatives provide insights into a function’s behaviour concerning individual variables.
• Gradient descent leverages these derivatives to iteratively optimise parameters and minimise cost
functions.
• The module highlighted the practical relevance of these concepts in machine learning, economics,
and engineering.

8
12 Keywords
• Partial Derivatives: Derivatives concerning one variable while keeping others constant.
• Gradient Descent: An optimisation technique to minimise cost functions iteratively.
• Multivariable Calculus: The extension of calculus to functions with multiple variables.

• Maxima and Minima: Points representing the highest or lowest values within a range.
• Optimisation Problems: Mathematical challenges to identify the best solution among feasible
options.

13 Self-Assessment Questions
1. How does a partial derivative in multivariable calculus differ from a conventional derivative in
single-variable calculus?
2. Explain the gradient descent technique and its use in determining a function’s minimum.

3. How are maxima and minima identified in multivariable calculus, and why are they significant?
4. How does multivariable calculus enhance the understanding of gradient descent, particularly in
higher dimensions?
5. What are the potential drawbacks of gradient descent, and why is it a preferred method for opti-
misation problems?

14 Case Study: Optimising Product Recommendations with

Gradient Descent
14.1 Introduction
In e-commerce, personalised product recommendations are critical for enhancing sales and user engage-
ment. Using gradient descent, recommendation systems can adapt dynamically to user preferences,
ensuring relevance and accuracy.

14.2 Background
TechNova’s recommendation system evaluates user data, including browsing behaviour, product ratings,
purchase history, and trends. However, its algorithm struggles to adapt to evolving user behaviour,
necessitating a more sophisticated approach.

14.3 Task
As a data scientist, your goal is to integrate gradient descent into TechNova’s recommendation algorithm
to dynamically adjust weights for improved accuracy.

14.4 Key Considerations

• How should weights be assigned to features like browsing behaviour and purchase history?
• What strategies ensure adaptability to user behaviour while avoiding overfitting?

• What challenges might arise in implementing gradient descent in a multivariate system?

• How should testing and validation assess the effectiveness of the improved algorithm?

9
14.5 Recommendations
• Analyse the limitations of the existing algorithm.
• Use gradient descent to optimise feature weights adaptively.
• Implement mechanisms to avoid overfitting, such as regularisation and adaptive learning rates.
• Employ A/B testing to evaluate improvements in real-time scenarios.

14.6 Conclusion
Integrating gradient descent can significantly enhance recommendation systems, ensuring personalised
and engaging user experiences.

15 References
1. Olver, P.J., & Shakiban, C. (2006). Applied Linear Algebra. Upper Saddle River, NJ: Prentice
Hall.
2. Bile Hassan, I., Ghanem, T., et al. (2021). Data science curriculum design: A case study. Pro-
ceedings of the 52nd ACM Technical Symposium on Computer Science Education, 529-534.
3. Ozdemir, S. (2016). Principles of Data Science. Packt Publishing Ltd.
4. Potters, M., & Bouchaud, J.P. (2020). A First Course in Random Matrix Theory: For Physicists,
Engineers, and Data Scientists. Cambridge University Press.
5. Cooper, S. (2018). Data Science from Scratch.

16 Principal Components Analysis (PCA)

16.1 Introduction to Principal Components Analysis (PCA)
Principal Components Analysis (PCA) is a widely employed statistical technique in data analysis aimed
at reducing the complexity of high-dimensional datasets. PCA seeks to comprehend and capture the
variability within the data, reducing its dimensionality while preserving a substantial portion of its initial
variance. This reduction ensures a high level of precision in data representation.
The primary rationale for employing PCA is the existence of correlations among variables in real-
world datasets, which often lead to redundancy. PCA addresses this issue by identifying a new set
of orthogonal axes, called principal components, which concentrate most of the data’s variance in the
first few components while minimizing variance in the remaining ones. These principal components are
linearly independent, ensuring orthogonality.
To visualize this concept, consider a collection of data points distributed in three-dimensional space.
While the data extends in all three dimensions, there may exist a predominant direction of maximum
variance. This direction defines the first principal component. The second principal component is
orthogonal to the first and captures the next highest variance, and so on.
PCA offers several advantages, including overcoming limitations of the original data properties and
revealing inherent structures and relationships among variables. Its applications span various fields such
as banking, biology, and image compression. Due to its ability to condense information and enhance
data interpretability, PCA remains an essential tool for data analysts and scientists worldwide.

16.2 Understanding How PCA Works Using Linear Algebra

PCA is fundamentally rooted in the principles of linear algebra. It leverages mathematical techniques
to transform high-dimensional data into a lower-dimensional representation while retaining key charac-
teristics.
The process begins with the computation of the covariance matrix, which encapsulates the variances
and covariances of feature pairs in the dataset. This symmetric matrix represents the relationships be-
tween changes in different features, with diagonal elements indicating variances and off-diagonal elements
representing covariances.

10
Next, the eigenvalues and eigenvectors of the covariance matrix are computed. The eigenvectors define
the new orthogonal axes (principal components) capturing the maximum variance in the data, while the
eigenvalues quantify the variance magnitude along these axes. The first principal component corresponds
to the eigenvector with the largest eigenvalue, followed by subsequent components in decreasing order of
eigenvalues.
The original data is then projected onto the eigenvectors through matrix multiplication, reducing
dimensionality by retaining only the components associated with the largest eigenvalues. This projection
transforms the dataset into the principal component space, simplifying its structure while preserving
essential information for further analysis or visualization.
From a linear algebra perspective, PCA identifies a new set of bases for the dataset, aligning them
with directions of maximum variance. This approach eliminates redundancy and correlated features,
ensuring that key data characteristics are retained. The mathematical robustness and elegance of PCA
make it a cornerstone in data analysis.

16.3 Implications and Applications of PCA in Data Science

PCA has significant implications and diverse applications in data science:

• Data Reduction and Efficiency: PCA reduces the dimensionality of datasets, improving com-
putational efficiency by retaining essential variance and minimizing information loss.
• Overcoming Multicollinearity: By transforming correlated variables into orthogonal principal
components, PCA mitigates issues of multicollinearity, enhancing model stability and interpretabil-
ity.
• Visualization: PCA facilitates visualization of high-dimensional data by reducing it to two or
three dimensions, enabling insightful scatter plots and 3D graphs.
• Improved Model Performance: By addressing the ”curse of dimensionality,” PCA reduces
overfitting in machine learning models, improving their generalization to unseen data.
• Domain-Specific Utility: PCA finds applications across domains, including genetics (population
structure analysis), finance (portfolio construction), and image processing (compression and face
recognition).

PCA bridges the gap between large datasets and simplicity, making it an indispensable tool for
extracting meaningful insights.

16.4 The Role of Eigendecomposition in PCA

Eigendecomposition, a fundamental concept in linear algebra, is central to PCA. To maximize the vari-
ance captured by principal components, the process begins with the computation of the covariance
matrix, which encodes linear relationships among variables.
The covariance matrix is then decomposed into eigenvalues and eigenvectors through eigendecom-
position. Each eigenvalue quantifies the variance captured along a corresponding eigenvector (principal
component). Larger eigenvalues indicate more significant variance. Eigenvectors define the orientation
of principal components in feature space, ensuring orthogonality and preserving data integrity during
dimensionality reduction.
Eigenvalues are sorted in descending order, and principal components associated with smaller eigen-
values are discarded. The data is then projected onto the retained eigenvectors, resulting in a lower-
dimensional dataset aligned with directions of maximum variance.
Eigendecomposition underpins the mathematical framework of PCA, enabling efficient dimensionality
reduction while retaining critical information. This synergy highlights the importance of linear algebra
in data analysis and dimensionality reduction.

17 5.5 Advantages of PCA in Dimensionality Reduction and

Data Visualisation
Principal Component Analysis (PCA) is widely used in data science and statistics for its effectiveness
in reducing dimensionality and enhancing data visualisation. High-dimensional datasets often contain

11
noise, which refers to undesired deviations or random fluctuations in the data. PCA effectively detects
and emphasizes the most prominent patterns, known as the principal components, within a dataset.
This process helps in removing unwanted noise, resulting in more meaningful data. By eliminating less
significant features with low variance, PCA condenses the data into a more refined form, potentially
improving model performance.
Reducing the number of features speeds up the training process of machine learning models. Fewer
features lead to decreased computational requirements, facilitating faster training and prediction times.
This efficiency is crucial when handling large datasets, where processing resources and time constraints
are critical.
Using high-dimensional spaces in machine learning algorithms can lead to overfitting. PCA addresses
the ”curse of dimensionality” by reducing the number of dimensions, resulting in models that generalize
better when applied to new, unseen data.
Human cognitive processing is limited when it comes to mentally representing and interpreting data
in high-dimensional spaces. PCA enables the transformation of high-dimensional datasets into two-
dimensional (2D) or three-dimensional (3D) visualizations. This technique enhances the visual represen-
tation of data and helps identify patterns, clusters, or relationships that might be difficult to discern in
higher-dimensional spaces.
PCA offers a concise representation of the dataset by retaining a significant portion of the variance
while reducing the number of dimensions. This compression not only reduces storage requirements but
also facilitates more efficient data transmission.
The effectiveness of PCA in dimensionality reduction and data visualisation can be attributed to
its ability to capture the essential characteristics of data in fewer dimensions. By removing irrelevant
information, PCA enhances data presentation, making it more suitable for computer analysis and human
comprehension. It remains an indispensable tool for analysts, data scientists, and statisticians.

18 5.6 Summary
• This module explores Principal Component Analysis (PCA), a robust technique based on linear
algebra. PCA is used to reduce the dimensionality of data while retaining as much variation as
possible. An in-depth analysis revealed the role of eigenvectors and eigenvalues of the data’s covari-
ance matrix in identifying principal components, which serve as the new axes for the transformed
data.
• PCA plays a significant role in data science by enabling the removal of noise, optimizing computa-
tional resources, and mitigating the risk of model overfitting. A key concept in PCA is eigendecom-
position, which is used to identify the principal directions along which data exhibits the greatest
variability.
• This module highlights the benefits of PCA, including dimensionality reduction, enhanced data
visualisation, and improved data modelling by focusing on the most important aspects of the
dataset. PCA remains a crucial tool in the analytical toolbox.

19 5.7 Keywords
• Principal Component Analysis (PCA): A widely used method in data analysis that transforms
and reduces the dimensions of a dataset while preserving the greatest variation.
• Eigenvectors: Directions in which data variation is maximized, and new axes are defined based
on these directions.
• Eigenvalues: Quantify the extent of variability along the axes defined by the corresponding
eigenvectors.
• Dimensionality Reduction: The process of reducing the number of variables in a dataset to
simplify analysis and visualisation.
• Covariance Matrix: A matrix representing the pairwise covariances between variables, essential
for PCA calculations.
• Data Visualisation: The graphical representation of data, which is strengthened by PCA to help
uncover patterns, distributions, and relationships within the data.

12
20 5.8 Self-Assessment Questions
1. What is the main objective of Principal Component Analysis (PCA) in data analysis?
2. What do eigenvalues and eigenvectors mean in the context of a dataset, and how do they relate to
PCA?

3. Why is dimensionality reduction important in the context of large datasets, particularly in data
science?
4. Explain the importance of the covariance matrix when using PCA on a dataset.
5. How can PCA improve data visualisation, especially for datasets with many variables?

21 5.9 Case Study

Title: Optimising Feature Representation: A Deep Dive into PCA in Genomic Data
Introduction: The field of genomics, which involves the comprehensive analysis of an organism’s
genetic material, generates substantial amounts of data. Converting and analyzing this data to derive
practical insights is a significant challenge. Principal Component Analysis (PCA) is a crucial tool used
by data scientists to effectively reduce the dimensionality of data, improving computational efficiency
and enhancing visual representation.
Case Study: The GenTech Bioinformatics Laboratory has collected extensive data on genetic vari-
ants across various populations. The dataset contains several variables corresponding to distinct genetic
markers. GenTech analysts are tasked with identifying trends and differences across populations but face
challenges due to the sheer volume of data.
Background: Historically, GenTech relied on manual techniques and basic statistical tools for genetic
data analysis. However, the increasing scale and complexity of the data have made these methods
impractical. Identifying significant genetic variants among populations has become labor-intensive, and
visualizing these variations is difficult due to the complexity of the data.
Your Task: You are a data scientist contracted by GenTech. The objective is to use Principal
Component Analysis (PCA) to reduce the dimensionality of the genetic data. By isolating the most
relevant principal components, you will help GenTech improve the clarity of differences and patterns
across populations.
Questions to Consider:
1. Which genetic markers have the greatest impact on the principal components?

2. What is the optimal number of principal components to retain in order to preserve a substantial
proportion of the original data variance?
3. How does the reduced data visualisation compare to the original high-dimensional data in terms
of clarity and differentiation across populations?

4. What are the limitations or disadvantages of using PCA in genomic data analysis?
Recommendations: PCA is expected to improve data visualisation and accelerate computing op-
erations. However, care must be taken to ensure that significant variation is retained when reducing
dimensions. Although PCA enhances data visualisation, it may not always yield the most biologically
relevant interpretations. Augmenting PCA with domain-specific knowledge in genetics will help derive
more meaningful insights.
Conclusion: The use of modern linear algebra techniques such as Principal Component Analysis
(PCA) has the potential to revolutionize data management practices in bioinformatics laboratories. PCA
provides a feasible solution for consolidating information while minimizing the loss of variability. This
technique enhances data visualisation and speeds up computational processes, though it is important to
balance mathematical convenience with biological relevance to obtain valuable insights.

13
22 5.10 References
1. Olver, P.J., Shakiban, C., and Shakiban, C. (2006). Applied Linear Algebra (Vol. 1). Upper Saddle
River, NJ: Prentice Hall.
2. Bile Hassan, I., Ghanem, T., Jacobson, D., Jin, S., Johnson, K., Sulieman, D., and Wei, W. (2021,
March). Data science curriculum design: A case study. In Proceedings of the 52nd ACM Technical
Symposium on Computer Science Education (pp. 529-534).
3. Ozdemir, S. (2016). Principles of Data Science. Packt Publishing Ltd.
4. Potters, M., and Bouchaud, J.P. (2020). A First Course in Random Matrix Theory: For Physicists,
Engineers, and Data Scientists. Cambridge University Press.

5. Cooper, S. (2018). Data Science from Scratch: The *1 Data Science Guide for Everything a
Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural
Networks, and Decision Trees. Roland Bind.

Gradient Descent
No ratings yet
Gradient Descent
17 pages
Bridge Condition Rating Data Modeling Using Deep Learning Algorithm
100% (1)
Bridge Condition Rating Data Modeling Using Deep Learning Algorithm
15 pages
DL Unit -2
No ratings yet
DL Unit -2
20 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
Gradient Descent a Fundamental Optimization Algorithm
No ratings yet
Gradient Descent a Fundamental Optimization Algorithm
30 pages
Gradient_Descent_(1)
No ratings yet
Gradient_Descent_(1)
8 pages
chp2 Gradient Descent algorithm
No ratings yet
chp2 Gradient Descent algorithm
5 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
AI33
No ratings yet
AI33
6 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Gradient Descent Unit3
No ratings yet
Gradient Descent Unit3
9 pages
Lec05-1-Gradient Descent-Detailed
No ratings yet
Lec05-1-Gradient Descent-Detailed
62 pages
Gradient Decent
No ratings yet
Gradient Decent
40 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
UNIT2
No ratings yet
UNIT2
25 pages
Gradient Descent - A Quick, Simple Introduction - Built in
No ratings yet
Gradient Descent - A Quick, Simple Introduction - Built in
15 pages
GD Types
No ratings yet
GD Types
98 pages
Lect 5- Gradient Descent
No ratings yet
Lect 5- Gradient Descent
31 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Gradient Descend
No ratings yet
Gradient Descend
64 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
Lec 5 - Gradient-Descent
No ratings yet
Lec 5 - Gradient-Descent
31 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Yash 21bsds12
No ratings yet
Yash 21bsds12
3 pages
Assignment B 4 GradientDescent
No ratings yet
Assignment B 4 GradientDescent
5 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
4. Gradient Descent
No ratings yet
4. Gradient Descent
15 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
Math Lecture 4
No ratings yet
Math Lecture 4
27 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Notes Unit 1-3 Part-III
No ratings yet
Notes Unit 1-3 Part-III
25 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
An Introduction To Gradient Descent and Linear Regression
No ratings yet
An Introduction To Gradient Descent and Linear Regression
8 pages
Introduction-to-Gradient-Descent (2)
No ratings yet
Introduction-to-Gradient-Descent (2)
8 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
Gradient Descent in Linear Regression
No ratings yet
Gradient Descent in Linear Regression
30 pages
Sheet 3 Sol 3
No ratings yet
Sheet 3 Sol 3
3 pages
Paper 2
No ratings yet
Paper 2
27 pages
Gradient Descent (3) (2)
No ratings yet
Gradient Descent (3) (2)
27 pages
What Is Gradient Descent - Built in
No ratings yet
What Is Gradient Descent - Built in
11 pages
Basic Machine Learning: Case Study
No ratings yet
Basic Machine Learning: Case Study
11 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Gradient Descent and Cost Function
No ratings yet
Gradient Descent and Cost Function
14 pages
ML Notes
No ratings yet
ML Notes
14 pages
Backpropagation, Sgmiod Neuron & Gradient Discend
No ratings yet
Backpropagation, Sgmiod Neuron & Gradient Discend
29 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
CU MSDS All Semesters Syllabus
No ratings yet
CU MSDS All Semesters Syllabus
10 pages
Multiple Solutions of Boundary Valu
No ratings yet
Multiple Solutions of Boundary Valu
21 pages
Report MDPI Logistic
No ratings yet
Report MDPI Logistic
1 page
Khuddush_Krushna
No ratings yet
Khuddush_Krushna
13 pages
My Abstract
No ratings yet
My Abstract
1 page
Khuddush.CMDE
No ratings yet
Khuddush.CMDE
17 pages
18 Merged123
No ratings yet
18 Merged123
20 pages
Keras-tensorflow-IT Haarlem 2023
No ratings yet
Keras-tensorflow-IT Haarlem 2023
35 pages
Deep learning-LSTM
No ratings yet
Deep learning-LSTM
55 pages
Unit-2 DL Cse
No ratings yet
Unit-2 DL Cse
21 pages
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
12 pages
Chapter 6 - Feedforward Deep Networks
No ratings yet
Chapter 6 - Feedforward Deep Networks
27 pages
CS 437 / CS 5317 Deep Learning: Murtaza Taj
No ratings yet
CS 437 / CS 5317 Deep Learning: Murtaza Taj
11 pages
Anomaly Detection in Lte Traffic Time Series Data Using Machine Learning
No ratings yet
Anomaly Detection in Lte Traffic Time Series Data Using Machine Learning
14 pages
AWA: Adversarial Website Adaptation: IEEE Transactions On Information Forensics and Security April 2021
No ratings yet
AWA: Adversarial Website Adaptation: IEEE Transactions On Information Forensics and Security April 2021
16 pages
Handwritten Digit Recognition Roadmap
No ratings yet
Handwritten Digit Recognition Roadmap
17 pages
Ai - Automated Cars Research Report
No ratings yet
Ai - Automated Cars Research Report
16 pages
Download Inference and Learning from Data: Volume 3: Learning 1st Edition Ali H. Sayed ebook All Chapters PDF
100% (2)
Download Inference and Learning from Data: Volume 3: Learning 1st Edition Ali H. Sayed ebook All Chapters PDF
50 pages
Entropy 22 00643
No ratings yet
Entropy 22 00643
18 pages
WILP Degree Course Descriptions
No ratings yet
WILP Degree Course Descriptions
80 pages
Inference and Learning from Data: Volume 2: Inference Ali H. Sayed download
100% (1)
Inference and Learning from Data: Volume 2: Inference Ali H. Sayed download
35 pages
Deep Learning - Question Bank
No ratings yet
Deep Learning - Question Bank
6 pages
MLDL Brochure
No ratings yet
MLDL Brochure
31 pages
Electrical Thermal Image Semantic Segmentation Large-Scale Dataset and Baseline
No ratings yet
Electrical Thermal Image Semantic Segmentation Large-Scale Dataset and Baseline
13 pages
Toward A Theory of Optimization For Over-Parameter
No ratings yet
Toward A Theory of Optimization For Over-Parameter
40 pages
Gen_AI Project Report
No ratings yet
Gen_AI Project Report
34 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
DIGI-Net: A Deep Convolutional Neural Network For Multi-Format Digit Recognition
No ratings yet
DIGI-Net: A Deep Convolutional Neural Network For Multi-Format Digit Recognition
11 pages
An Introduction To Machine Learning Communications
No ratings yet
An Introduction To Machine Learning Communications
11 pages
SSRN 3335536
No ratings yet
SSRN 3335536
35 pages
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
No ratings yet
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
7 pages
Detection of Direct Sequence Spread Spectrum Signals Based On Deep Learning
No ratings yet
Detection of Direct Sequence Spread Spectrum Signals Based On Deep Learning
12 pages
Thesis Abhishek Singh Final
No ratings yet
Thesis Abhishek Singh Final
65 pages
Dictionary of Artificial Intelligence
No ratings yet
Dictionary of Artificial Intelligence
492 pages
Report
No ratings yet
Report
49 pages

LInear

Uploaded by

LInear

Uploaded by

1 Linear Algebra-Chapter 4 and 5: M.

2 Introduction to Gradient Descent

2.1 Variants of Gradient Descent

2.1.2 Stochastic Gradient Descent (SGD)

2.1.3 Mini-batch Gradient Descent

2.2 Convergence and Learning Rate

2.3 Convergence Criteria

2.4 Example Algorithm for Gradient Descent

2.4.1 Example Function:

Python Code Example for Gradient Descent

# Function and its derivative

# Gradient Descent Parameters

# Store the values of x during the process

# Gradient Descent Loop

# Plotting the results

plt.plot(x_range, y_range, label=’f(x) = x^2’)

print(f"Final value of x: {x}")

3 Output of the Gradient Descent Process

3.1 Applications of Gradient Descent

3.1.2 Logistic Regression

3.1.3 Neural Networks

3.1.4 Support Vector Machines (SVMs)

3.1.5 Reinforcement Learning

3.2 Challenges and Improvements

4 Understanding Gradient Descent Using Multivariable Calcu-

4.1 Multivariable Functions and Gradient Descent

4.3 Update Rule and Learning Rate

• α > 0 is the learning rate, which controls the step size.

4.4 Iterative Process and Convergence

• The change in the function value is below a predefined threshold.

4.5 Example: Gradient Descent for a Multivariable Function

f (x1 , x2 ) = x21 + x22

The gradient of f is:  

# Function and its gradient

def grad_f(x1, x2):

# Gradient Descent Parameters

# Store the values of x during the process

# Gradient Descent Loop

# Plotting the results

plt.contour(X1, X2, Z, 50, cmap=’jet’)

print(f"Final values: x1 = {x[0]}, x2 = {x[1]}")

4.6 Gradient Descent in High-Dimensional Spaces

4.7 Applications of Gradient Descent in Machine Learning

4.8 Challenges and Improvements in Gradient Descent

5 Maxima and Minima in Multivariable Calculus

8 Applications of Gradient Descent

8.2 Economics and Engineering

9 Challenges and Variations

14 Case Study: Optimising Product Recommendations with

14.4 Key Considerations

• What challenges might arise in implementing gradient descent in a multivariate system?

16 Principal Components Analysis (PCA)

16.2 Understanding How PCA Works Using Linear Algebra

16.3 Implications and Applications of PCA in Data Science

16.4 The Role of Eigendecomposition in PCA

17 5.5 Advantages of PCA in Dimensionality Reduction and

21 5.9 Case Study

You might also like

The gradient of f is: