LInear
LInear
Sc Data Science
1.1 Interpretation of Partial Derivative Values
Within the domain of multivariable calculus, partial derivatives assume fundamental significance in
understanding how a function changes with respect to one of its variables while holding all other variables
constant. In essence, a partial derivative quantifies the rate at which a function changes with respect to
a single variable while keeping all other variables constant.
Consider a function f (x, y), which represents the height of a hilly terrain at any given position (x, y).
The partial derivative of f with respect to the variable x, represented as ∂f∂x , indicates the rate of change
in height in the x-direction. A positive value signifies an upward change in height while moving in the
positive x-direction, whereas a negative value denotes a downward change. Similarly, ∂f ∂y measures the
variation in the y-direction.
The magnitude of partial derivatives is significant for interpretation. A larger magnitude indicates
steeper terrain or a more rapid rate of change, while a smaller magnitude implies a gentler incline.
Furthermore, the sign of a partial derivative provides key insights: a positive value signifies an increase
in the function with respect to the corresponding variable, while a negative value indicates a decrease.
A partial derivative value of zero suggests the possibility of a local maximum, minimum, or saddle point
with respect to that variable.
Partial derivatives are crucial in sensitivity analysis across various domains, including economics,
physics, and machine learning. For instance, in economics, one can study how changes in the price of
a product, holding all other variables constant, affect its demand. Thus, partial derivatives transcend
mathematical abstraction and are pivotal in comprehending the behavior of multivariable functions across
diverse scenarios.
1
2.1.1 Batch Gradient Descent (BGD)
In this method, the gradient of the entire dataset is computed before updating the parameters. This
means that every iteration involves computing the gradient for all data points, making it computationally
expensive for large datasets.
• Pros: Can converge to the global minimum if the learning rate is chosen correctly.
• Cons: Computationally expensive and slow for large datasets.
2
2.4.2 Python Code:
Below is the Python code demonstrating gradient descent for the function f (x) = x2 . The learning rate
α is set to 0.1, and the algorithm runs for 100 iterations, starting from x = 10. The resulting path of
the gradient descent is visualized on the plot.
import numpy as np
import matplotlib.pyplot as plt
def df(x):
return 2*x
3
3.1.1 Linear Regression
Gradient Descent is used to minimize the Mean Squared Error (MSE) in linear regression problems. By
updating the weights in the direction of the negative gradient, we find the best-fitting line.
3.3 Conclusion
Gradient Descent is a powerful optimization technique that underpins much of modern machine learning
and AI. Its simplicity, combined with its ability to handle high-dimensional data efficiently, makes it
indispensable in practice. However, the choice of learning rate and handling local minima can be tricky,
and thus variants like mini-batch and adaptive gradient descent methods are often preferred in real-world
applications.
4
4.2 The Gradient and Direction of Steepest Ascent
The gradient of a function f at a point x = (x1 , x2 , . . . , xn ) is a vector of partial derivatives with respect
to each variable:
∂f ∂f ∂f
∇f (x) = , ,...,
∂x1 ∂x2 ∂xn
The gradient points in the direction of the steepest ascent of the function. This means that if we follow
the direction of the gradient, we will increase the value of the function the most rapidly. To minimize f ,
we move in the opposite direction of the gradient, following the direction of steepest descent.
xnew
1 = xold old
1 − α · 2x1
xnew
2 = xold old
2 − α · 2x2
This update rule moves the values of x1 and x2 towards the minimum of the function.
5
4.5.1 Python Code Example
The following Python code demonstrates gradient descent applied to the above quadratic function:
import numpy as np
import matplotlib.pyplot as plt
6
• Logistic Regression: Used for classification problems, where the gradient descent algorithm
minimizes the log loss (cross-entropy loss).
• Neural Networks: The backpropagation algorithm uses gradient descent to update the weights
of neurons to minimize the loss function.
• Support Vector Machines (SVMs): Gradient descent can be used to optimize the hinge loss
function in SVMs.
• Reinforcement Learning: Policy gradient methods use gradient descent to optimize the agent’s
policy.
• Exploding/Vanishing Gradients: In deep neural networks, gradients can either become too
large or too small, making training unstable or slow. Techniques like Gradient Clipping and Batch
Normalization help address these issues.
4.9 Conclusion
Gradient descent, when applied to multivariable functions, is a cornerstone of optimization in machine
learning and deep learning. It allows for efficient parameter updates in high-dimensional spaces, en-
abling the training of complex models. However, choosing an appropriate learning rate, dealing with
local minima, and managing large datasets are key challenges that require careful handling in practical
applications.
• Local Minimum: The Hessian matrix determinant is positive, and its leading diagonal entries
are positive.
• Local Maximum: The Hessian determinant is positive, but leading diagonal entries are negative.
• Saddle Point: The Hessian determinant is negative.
Global extrema may also occur at the boundaries of the function’s domain. Evaluating function
values at critical points and along the boundary helps identify global maxima and minima.
The concepts of maxima and minima in multivariable calculus are essential in optimizing complex
systems and understanding natural phenomena. By analyzing functions in multidimensional spaces,
one can extract insights concealed within high-dimensional data, enabling advancements in science and
engineering.
7
6 Application of Gradient Descent in Optimisation Problems
Abstract
Optimisation problems are fundamental across various disciplines, including machine learning, economics,
and engineering. These problems often involve minimising or maximising an objective function. Gradient
descent, an iterative optimisation method, plays a pivotal role in solving such problems, particularly
when analytical solutions are challenging to obtain. This paper explores the mechanism, significance,
and applications of gradient descent in optimisation tasks.
7 Introduction
The core goal of gradient descent is to iteratively update parameters to minimise a cost function. By
computing the gradient of the function with respect to the parameters at a specific point, the algorithm
moves in the opposite direction of the gradient, ensuring a consistent reduction in the function’s value.
Over iterations, this approach converges to the function’s minimum.
The learning rate, a crucial parameter, determines the step size for each iteration. A high learning
rate risks overshooting the minimum, leading to instability, while a low learning rate results in slow
convergence. Adaptive strategies are often employed to balance these trade-offs.
10 Conclusion
Gradient descent is a robust, iterative approach for solving complex optimisation problems. Its concep-
tual simplicity and wide applicability make it a fundamental tool across disciplines.
11 Summary
• Module 4 explored multivariable calculus and its real-world implications.
• Partial derivatives provide insights into a function’s behaviour concerning individual variables.
• Gradient descent leverages these derivatives to iteratively optimise parameters and minimise cost
functions.
• The module highlighted the practical relevance of these concepts in machine learning, economics,
and engineering.
8
12 Keywords
• Partial Derivatives: Derivatives concerning one variable while keeping others constant.
• Gradient Descent: An optimisation technique to minimise cost functions iteratively.
• Multivariable Calculus: The extension of calculus to functions with multiple variables.
• Maxima and Minima: Points representing the highest or lowest values within a range.
• Optimisation Problems: Mathematical challenges to identify the best solution among feasible
options.
13 Self-Assessment Questions
1. How does a partial derivative in multivariable calculus differ from a conventional derivative in
single-variable calculus?
2. Explain the gradient descent technique and its use in determining a function’s minimum.
3. How are maxima and minima identified in multivariable calculus, and why are they significant?
4. How does multivariable calculus enhance the understanding of gradient descent, particularly in
higher dimensions?
5. What are the potential drawbacks of gradient descent, and why is it a preferred method for opti-
misation problems?
14.2 Background
TechNova’s recommendation system evaluates user data, including browsing behaviour, product ratings,
purchase history, and trends. However, its algorithm struggles to adapt to evolving user behaviour,
necessitating a more sophisticated approach.
14.3 Task
As a data scientist, your goal is to integrate gradient descent into TechNova’s recommendation algorithm
to dynamically adjust weights for improved accuracy.
9
14.5 Recommendations
• Analyse the limitations of the existing algorithm.
• Use gradient descent to optimise feature weights adaptively.
• Implement mechanisms to avoid overfitting, such as regularisation and adaptive learning rates.
• Employ A/B testing to evaluate improvements in real-time scenarios.
14.6 Conclusion
Integrating gradient descent can significantly enhance recommendation systems, ensuring personalised
and engaging user experiences.
15 References
1. Olver, P.J., & Shakiban, C. (2006). Applied Linear Algebra. Upper Saddle River, NJ: Prentice
Hall.
2. Bile Hassan, I., Ghanem, T., et al. (2021). Data science curriculum design: A case study. Pro-
ceedings of the 52nd ACM Technical Symposium on Computer Science Education, 529-534.
3. Ozdemir, S. (2016). Principles of Data Science. Packt Publishing Ltd.
4. Potters, M., & Bouchaud, J.P. (2020). A First Course in Random Matrix Theory: For Physicists,
Engineers, and Data Scientists. Cambridge University Press.
5. Cooper, S. (2018). Data Science from Scratch.
10
Next, the eigenvalues and eigenvectors of the covariance matrix are computed. The eigenvectors define
the new orthogonal axes (principal components) capturing the maximum variance in the data, while the
eigenvalues quantify the variance magnitude along these axes. The first principal component corresponds
to the eigenvector with the largest eigenvalue, followed by subsequent components in decreasing order of
eigenvalues.
The original data is then projected onto the eigenvectors through matrix multiplication, reducing
dimensionality by retaining only the components associated with the largest eigenvalues. This projection
transforms the dataset into the principal component space, simplifying its structure while preserving
essential information for further analysis or visualization.
From a linear algebra perspective, PCA identifies a new set of bases for the dataset, aligning them
with directions of maximum variance. This approach eliminates redundancy and correlated features,
ensuring that key data characteristics are retained. The mathematical robustness and elegance of PCA
make it a cornerstone in data analysis.
• Data Reduction and Efficiency: PCA reduces the dimensionality of datasets, improving com-
putational efficiency by retaining essential variance and minimizing information loss.
• Overcoming Multicollinearity: By transforming correlated variables into orthogonal principal
components, PCA mitigates issues of multicollinearity, enhancing model stability and interpretabil-
ity.
• Visualization: PCA facilitates visualization of high-dimensional data by reducing it to two or
three dimensions, enabling insightful scatter plots and 3D graphs.
• Improved Model Performance: By addressing the ”curse of dimensionality,” PCA reduces
overfitting in machine learning models, improving their generalization to unseen data.
• Domain-Specific Utility: PCA finds applications across domains, including genetics (population
structure analysis), finance (portfolio construction), and image processing (compression and face
recognition).
PCA bridges the gap between large datasets and simplicity, making it an indispensable tool for
extracting meaningful insights.
11
noise, which refers to undesired deviations or random fluctuations in the data. PCA effectively detects
and emphasizes the most prominent patterns, known as the principal components, within a dataset.
This process helps in removing unwanted noise, resulting in more meaningful data. By eliminating less
significant features with low variance, PCA condenses the data into a more refined form, potentially
improving model performance.
Reducing the number of features speeds up the training process of machine learning models. Fewer
features lead to decreased computational requirements, facilitating faster training and prediction times.
This efficiency is crucial when handling large datasets, where processing resources and time constraints
are critical.
Using high-dimensional spaces in machine learning algorithms can lead to overfitting. PCA addresses
the ”curse of dimensionality” by reducing the number of dimensions, resulting in models that generalize
better when applied to new, unseen data.
Human cognitive processing is limited when it comes to mentally representing and interpreting data
in high-dimensional spaces. PCA enables the transformation of high-dimensional datasets into two-
dimensional (2D) or three-dimensional (3D) visualizations. This technique enhances the visual represen-
tation of data and helps identify patterns, clusters, or relationships that might be difficult to discern in
higher-dimensional spaces.
PCA offers a concise representation of the dataset by retaining a significant portion of the variance
while reducing the number of dimensions. This compression not only reduces storage requirements but
also facilitates more efficient data transmission.
The effectiveness of PCA in dimensionality reduction and data visualisation can be attributed to
its ability to capture the essential characteristics of data in fewer dimensions. By removing irrelevant
information, PCA enhances data presentation, making it more suitable for computer analysis and human
comprehension. It remains an indispensable tool for analysts, data scientists, and statisticians.
18 5.6 Summary
• This module explores Principal Component Analysis (PCA), a robust technique based on linear
algebra. PCA is used to reduce the dimensionality of data while retaining as much variation as
possible. An in-depth analysis revealed the role of eigenvectors and eigenvalues of the data’s covari-
ance matrix in identifying principal components, which serve as the new axes for the transformed
data.
• PCA plays a significant role in data science by enabling the removal of noise, optimizing computa-
tional resources, and mitigating the risk of model overfitting. A key concept in PCA is eigendecom-
position, which is used to identify the principal directions along which data exhibits the greatest
variability.
• This module highlights the benefits of PCA, including dimensionality reduction, enhanced data
visualisation, and improved data modelling by focusing on the most important aspects of the
dataset. PCA remains a crucial tool in the analytical toolbox.
19 5.7 Keywords
• Principal Component Analysis (PCA): A widely used method in data analysis that transforms
and reduces the dimensions of a dataset while preserving the greatest variation.
• Eigenvectors: Directions in which data variation is maximized, and new axes are defined based
on these directions.
• Eigenvalues: Quantify the extent of variability along the axes defined by the corresponding
eigenvectors.
• Dimensionality Reduction: The process of reducing the number of variables in a dataset to
simplify analysis and visualisation.
• Covariance Matrix: A matrix representing the pairwise covariances between variables, essential
for PCA calculations.
• Data Visualisation: The graphical representation of data, which is strengthened by PCA to help
uncover patterns, distributions, and relationships within the data.
12
20 5.8 Self-Assessment Questions
1. What is the main objective of Principal Component Analysis (PCA) in data analysis?
2. What do eigenvalues and eigenvectors mean in the context of a dataset, and how do they relate to
PCA?
3. Why is dimensionality reduction important in the context of large datasets, particularly in data
science?
4. Explain the importance of the covariance matrix when using PCA on a dataset.
5. How can PCA improve data visualisation, especially for datasets with many variables?
2. What is the optimal number of principal components to retain in order to preserve a substantial
proportion of the original data variance?
3. How does the reduced data visualisation compare to the original high-dimensional data in terms
of clarity and differentiation across populations?
4. What are the limitations or disadvantages of using PCA in genomic data analysis?
Recommendations: PCA is expected to improve data visualisation and accelerate computing op-
erations. However, care must be taken to ensure that significant variation is retained when reducing
dimensions. Although PCA enhances data visualisation, it may not always yield the most biologically
relevant interpretations. Augmenting PCA with domain-specific knowledge in genetics will help derive
more meaningful insights.
Conclusion: The use of modern linear algebra techniques such as Principal Component Analysis
(PCA) has the potential to revolutionize data management practices in bioinformatics laboratories. PCA
provides a feasible solution for consolidating information while minimizing the loss of variability. This
technique enhances data visualisation and speeds up computational processes, though it is important to
balance mathematical convenience with biological relevance to obtain valuable insights.
13
22 5.10 References
1. Olver, P.J., Shakiban, C., and Shakiban, C. (2006). Applied Linear Algebra (Vol. 1). Upper Saddle
River, NJ: Prentice Hall.
2. Bile Hassan, I., Ghanem, T., Jacobson, D., Jin, S., Johnson, K., Sulieman, D., and Wei, W. (2021,
March). Data science curriculum design: A case study. In Proceedings of the 52nd ACM Technical
Symposium on Computer Science Education (pp. 529-534).
3. Ozdemir, S. (2016). Principles of Data Science. Packt Publishing Ltd.
4. Potters, M., and Bouchaud, J.P. (2020). A First Course in Random Matrix Theory: For Physicists,
Engineers, and Data Scientists. Cambridge University Press.
5. Cooper, S. (2018). Data Science from Scratch: The *1 Data Science Guide for Everything a
Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural
Networks, and Decision Trees. Roland Bind.
14