0% found this document useful (0 votes)
2 views

calcForMachineLearning

The document is a comprehensive guide on calculus tailored for machine learning, covering fundamental concepts such as the rate of change, derivatives, and their geometric interpretations. It includes detailed chapters on vector calculus, linear algebra, and practical applications like minimizing functions. The content is structured to facilitate understanding of calculus principles essential for machine learning algorithms.

Uploaded by

csainath449
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

calcForMachineLearning

The document is a comprehensive guide on calculus tailored for machine learning, covering fundamental concepts such as the rate of change, derivatives, and their geometric interpretations. It includes detailed chapters on vector calculus, linear algebra, and practical applications like minimizing functions. The content is structured to facilitate understanding of calculus principles essential for machine learning algorithms.

Uploaded by

csainath449
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Calculus for Machine Learning

Daniel O’Connor
Contents

Preface vii

Part 1. Calculus 1

Chapter 1. Rate of change 3


1.1. Geometric interpretation of the derivative 6
1.2. The fundamental strategy of calculus 7
1.3. Warning: Sometimes the derivative does not exist 9

Chapter 2. Formulas for the derivative 13


2.1. Derivative of a constant function 13
2.2. Derivative of a sum 13
2.3. Derivative of f (x) = cg(x) 14
2.4. The product rule 15
2.5. The chain rule 18
2.6. Derivative of bx 19
2.7. Derivative of log(x) 22
2.8. The power rule 23

Part 2. Vector calculus and linear algebra 25

Chapter 3. Points and vectors 27


3.1. Method 1: The point picture 28
3.2. Method 2: The vector picture 28
3.3. Vector operations 29

Chapter 4. The gradient vector 41


v
vi CONTENTS

4.1. The directional derivative 42


4.2. Partial derivatives 43
4.3. Newton’s approximation for partial derivatives 45
4.4. Newton’s approximation when f : Rn → R 45
4.5. A formula for directional derivatives 47
4.6. The direction of steepest ascent 47

Chapter 5. The Jacobian matrix 49


Chapter 6. Matrix multiplication 53
6.1. A matrix wants to operate on a vector 53
6.2. Some useful rules of arithmetic 54
6.3. Another perspective on matrix-vector multiplication 55
6.4. Multiplying a matrix by a matrix 56
6.5. When multiplying matrices, order matters 57
6.6. Conventions about column vectors and row vectors 58
6.7. Transposing matrices 60
6.8. Matrix addition 62
6.9. Additional exercises 63

Chapter 7. The chain rule 65


Chapter 8. Minimizing a function 69
Appendix A. Algebra review 71
A.1. FOIL 71
A.2. Difference of squares 72

Appendix B. The equation of a line 73


Preface

vii
Part 1

Calculus
CHAPTER 1

Rate of change

What one fool can do, so can another.

An “ancient Simian proverb” appearing in


Calculus Made Easy and sometimes
repeated by Richard Feynman

The name “calculus” doesn’t tell you what the subject is


about, so here it is: the main idea of calculus is instantaneous rate
of change. Everyone who has read a speedometer understands
this concept intuitively. If you are driving a car, you might see
the speedometer steadily increase from 0 meters per second to
30 meters per second. Along the way, there was a single instant
in time at which your speed was 20 meters per second.
Before writing down a precise definition of “instantaneous
rate of change”, let’s first review the concept of average rate of
change. Let f : R → R be a function. For example, f could be
the function that takes as input the number t of seconds that
have passed since you began driving, and returns as output the
number f (t) of meters that the car has traveled during this time.
(This function f is like an odometer — it tells you the total
distance that you have traveled so far.) The average velocity of
the car during the time interval from a time t0 to a later time t is
distance traveled f (t) − f (t0 )
average velocity = = .
time elapsed t − t0
For example, suppose that at time t0 = 30 the odometer reports
that the car has traveled a total distance of 120 meters, so
3
4 1. RATE OF CHANGE

that f (30) = 120, and suppose also that two seconds later the
odometer reports a total distance traveled of 160 meters, so
that f (32) = 160. Then the car has traveled f (32) − f (30) =
160 − 120 = 40 meters during the two second time interval from
time t0 = 30 to time t = 32. The average velocity of the car
during this time interval is (160 − 120)/(32 − 30) = 40/2 = 20
meters per second.
Now imagine that the time interval from t0 to t is very short,
so that time t is only a split second later than time t0 . Then
the average velocity during the time interval from t0 to t is a
good approximation to the instantaneous velocity of the car at
time t0 . Even better approximations can be obtained by taking
t to be closer and closer to t0 . In fact, we can approximate the
instantaneous velocity as closely as we like by taking t to be
sufficiently close to t0 . We express this fact concisely by saying
that the instantaneous velocity at time t0 is equal to the limit
as t approaches t0 of the average velocity (f (t) − f (t0 ))/(t − t0 ).
To save writing, the instantaneous velocity at time t0 is denoted
f 0 (t0 ). In summary:

f (t) − f (t0 )
f 0 (t0 ) = lim . (1.1)
t→t0 t − t0

This idea of taking a limit is illustrated with a numerical example


in table 1.
The number f 0 (t0 ) has been given an undescriptive and
unnecessarily intimidating name: it is called the derivative of
the function f at t0 . The function f 0 , which takes a number
t0 as input and returns the number f 0 (t0 ) as output, is called
the derivative of f . Another common notation for f 0 is df dt . The
df
notation dt arguably has more mnemonic power, as it reminds
us that if t is close to t0 then

∆f
f 0 (t0 ) ≈ ,
∆t
1. RATE OF CHANGE 5

f (t)−f (t0 )
t f (t) t−t0

2.1 4.41 4.1


2.01 4.0401 4.01
2.001 4.004001 4.001
Table 1. Here we illustrate the idea of taking a limit
in equation (1.1). As t approaches t0 , the average
rate of change approaches the instantaneous rate of
change. In this example, f (t) = t2 and t0 = 2. As
we try out values of t that get closer and closer to t0 ,
we observe that the average rate of change (shown in
the rightmost column) appears to be getting closer
and closer to 4. This suggests that f 0 (2) = 4.

where ∆t = t − t0 is a small change in value of the input to f


and ∆f = f (t) − f (t0 ) is the corresponding change in the value
of the output.
In the above discussion, we used a moving car as an example,
but f can be any function that takes a real number as input
and returns a real number as output. A car’s velocity is only
one example of the concept of instantaneous rate of change of
a quantity. And while we used the letter t as a name for the
function input, we may of course use any letter we want, such
as x. We would then write equation (1.1) as

f (x) − f (x0 )
f 0 (x0 ) = lim . (1.2)
x→x0 x − x0

Example 1.1. To make this discussion more concrete, let’s


compute the derivative of the function f (x) = x2 . We must
evaluate the limit on the right in equation (1.2). Notice that if
6 1. RATE OF CHANGE

x 6= x0 then
f (x) − f (x0 ) x2 − x20
=
x − x0 x − x0
(x − x0 )(x + x0 )
=
x − x0
= x + x0 .
Clearly, x + x0 approaches x0 + x0 = 2x0 as x approaches x0 .
This shows that
f 0 (x0 ) = 2x0 . (1.3)
0
For example, f (2) = 2 · 2 = 4. This confirms the result that we
observed in table 1.
One of the main goals of a calculus course is to derive a large
number of rules like this for computing the derivatives of specific
functions. We will derive more such rules in section 2.
Exercise 1.2. Derive formulas for the derivatives of the
functions x, x3 , and 1/x.

1.1. Geometric interpretation of the derivative


There is a nice geometric interpretation of the derivative that
is illustrated in figure 1. If x is close to x0 , then (x0 , f (x0 )) and
(x, f (x)) are nearby points on the graph of f . The line connecting
these two points, shown in figure 1a, is called a “secant line”.
The slope of this line is given by the formula slope = rise run , and
as can be seen in figure 1a, the run is x − x0 and the rise is
f (x) − f (x0 ). Thus:
rise f (x) − f (x0 )
slope of secant line = = .
run x − x0
As x approaches x0 , the point (x, f (x)) approaches the point
(x0 , f (x0 )), and the slope of the secant line approaches the slope
of the “tangent line” that is shown in figure 1b. So, the slope of
the tangent line in figure 1b is
f (x) − f (x0 )
slope of tangent line = lim .
x→x0 x − x0
1.2. THE FUNDAMENTAL STRATEGY OF CALCULUS 7

The quantity on the right is none other than the derivative f 0 (x0 ).
We have found our geometric interpretation of the derivative:

The derivative is the slope of the tangent line.

1.2. The fundamental strategy of calculus


The definition of the derivative (equation (1.1)) tells us that
the approximation
f (x) − f (x0 )
f 0 (x0 ) ≈ (1.4)
x − x0
becomes more and more accurate as we select values of x that
are closer and closer to x0 , and that any desired level of accuracy
can be obtained by restricting x to be sufficiently close to x0 .
Visually, we are approximating the slope of the tangent line by
computing the slope of a secant line. Multiplying both sides
of (1.4) by x − x0 , we obtain

f (x) ≈ f (x0 ) + f 0 (x0 )(x − x0 ). (1.5)

The approximation is good when x is close to x0 .


Equation (1.5) is called “Newton’s approximation”, and it is
extremely useful for the following reason: Although f itself might
be a complicated nonlinear function, f can be approximated
accurately by the very simple linear function L(x) = f (x0 ) +
f 0 (x0 )(x − x0 ) which appears on the right in equation (1.5).
The graph of L is a straight line that passes through the point
(x0 , f (x0 )) and has slope f 0 (x0 ). In other words, the graph of L
is the tangent line shown in figure 1b. The function L is called
the linear approximation to f near x0 .
The fundamental strategy of calculus is to replace f (which
is difficult to work with) with a linear approximation to f (which
is easy to work with). When we do this, whatever calculations
we want to perform are greatly simplified, and often the approxi-
mation is accurate enough that the result of the calculation is
useful. This strategy is used again and again throughout calculus,
8 1. RATE OF CHANGE

(a) The line connecting the points (x0 , f (x0 )) and


(x, f (x)). is called a “secant line”. We use the formula
slope = rise
run
to compute the slope of the secant line. The
run is x − x0 and the rise is f (x) − f (x0 ), so the slope is
f (x)−f (x0 )
x−x
.
0

(b) As x approaches x0 , the point (x, f (x)) approaches


the point (x0 , f (x0 )), and the slope of the secant line
approaches the slope of the tangent line. Thus, the slope
f (x)−f (x0 )
of the tangent line is limx→x0 x−x
.
0

Figure 1. The derivative is the slope of the tangent line.


1.3. SOMETIMES THE DERIVATIVE DOES NOT EXIST 9

and it makes calculus easy. It is the key to calculus. Newton’s


approximation (1.5) should be internalized so that you can use
it effortlessly and reflexively.
Suppose that you are driving and at a particular moment
the speedometer reads 20 meters per second, and you are asked
to estimate how far the car travels during the next two seconds.
Even if you don’t know calculus, you will estimate 20 × 2 = 40
meters. You already use the approximation (1.5) even if you do
not realize it. It is so intuitive that a child would give the same
answer.

1.3. Warning: Sometimes the derivative does not exist


The geometric interpretation of the derivative helps us to
understand visually something that can go wrong when comput-
ing f 0 (x0 ). Figure 2a shows the graph of the ramp function f
defined by
(
x if x ≥ 0,
f (x) =
0 if x < 0.
This ramp function is also called the “ReLU” function (yet
another undescriptive name), and it plays an important role
in neural networks. For this example, let x0 = 0 (and note
that f (x0 ) = 0). As shown in figure 2a, if x > x0 , then the
slope of the secant line connecting (x0 , f (x0 )) and (x, f (x)) is
f (x)−f (x0 )
x−x0 = x−0
x−0 = 1. Thus, as x approaches x0 from the right,
the slope of the secant line approaches 1:
f (x) − f (x0 )
lim = 1.
x&x0 x − x0
The symbol & indicates that x approaches x0 from the right (in
other words, x decreases towards x0 ).
However, we get a different result if x approaches x0 from the
left. As shown in figure 2b, if x < x0 , then the slope of the secant
line connecting (x0 , f (x0 )) and (x, f (x)) is f (x)−f
x−x0
(x0 ) 0−0
= x−0 = 0.
So, as x approaches x0 from the left, the slope of the secant line
10 1. RATE OF CHANGE

approaches 0:
f (x) − f (x0 )
lim = 0.
x%x0 x − x0
The fact that the slope of the secant line approaches different
values depending on whether x approaches x0 from the right or
from the left means that in this example f (x)−f (x0 )
x−x0 simply does
not have a unique limit as x approaches x0 :
f (x) − f (x0 )
lim does not exist.
x→x0 x − x0
The function f does not have a derivative at 0.
Imagine that a car is driving along at 10 meters per second,
and then at time t = 30 the car’s velocity magically jumps to 20
meters per second (perhaps due to a glitch in the matrix). What
is the car’s instantaneous velocity at time t = 30? There is no
correct answer. Both values 10 meters per second and 20 meters
per second would be equally valid. The car simply does not have
a velocity at that instant in time.
When we make the statement
f (x) − f (x0 )
lim = L,
x→x0 x − x0
we insist that f (x)−f
x−x0
(x0 )
approaches the same limiting value L
no matter what path x follows as x approaches x0 . If that is not
the case, then the statement is not true, and f 0 (x0 ) simply does
not exist.
Before we compute the derivative of a function, we should be
careful to first check that the derivative exists. But don’t worry,
most functions we encounter have perfectly smooth graphs, with
no sharp corners where the tangent line is not well defined.
Here is a bit more terminology. The process of computing
the derivative of a function is called “differentiation”. If f has
a derivative at x0 , then f is said to be “differentiable” at x0 .
We have seen in this section that the ramp function is not dif-
ferentiable at 0. However, the ramp function is differentiable
everywhere else.
1.3. SOMETIMES THE DERIVATIVE DOES NOT EXIST 11

(a) If x > x0 , then the slope of the secant line


connecting (x0 , f (x0 )) and (x, f (x)) is 1. Thus,
f (x)−f (x0 )
limx&x0 x−x
= 1.
0

(b) However, if x < x0 , then the slope of the secant line


f (x)−f (x0 )
is 0. Thus, limx%x0 x−x
= 0 6= 1.
0

Figure 2. The ramp function does not have a de-


rivative at x0 = 0. The derivative is supposed to
be the slope of the tangent line, but there is not a
unique tangent line at this point.
CHAPTER 2

Formulas for the derivative

In this chapter we will discover some useful rules for comput-


ing derivatives. In the process we will see our first examples of
the fundamental strategy of calculus in action.

2.1. Derivative of a constant function


Suppose that f is a constant function. In other words, there
is some number c such that f (x) = c for all possible values of x.
Then, if x 6= x0 , we have
f (x) − f (x0 ) c−c
= = 0.
x − x0 x − x0
It follows that
f (x) − f (x0 )
f 0 (x0 ) = lim = lim 0 = 0.
x→x0 x − x0 x→x0

In words:

The derivative of a constant function is 0.

If a car’s position is not changing, then its velocity is 0.

2.2. Derivative of a sum


Suppose that
f (x) = g(x) + h(x),
13
14 2. FORMULAS FOR THE DERIVATIVE

and both f and g are differentiable at x0 . If x 6= x0 then


f (x) − f (x0 ) g(x) + h(x) − (g(x0 ) + h(x0 ))
=
x − x0 x − x0
g(x) − g(x0 ) h(x) − h(x0 )
= + .
x − x0 x − x0
| {z } | {z }
approaches g 0 (x0 ) approaches h0 (x0 )

As x approaches x0 , the first term on the right approaches g 0 (x0 )


and the second term on the right approaches h0 (x0 ). Thus,
f 0 (x0 ) = g 0 (x0 ) + h0 (x0 ).
In words:
The derivative of a sum is the sum of the derivatives.

Example 2.1. If
f (x) = 1 + x2
↑ ↑
g(x) h(x)

then
f 0 (x0 ) = 0 + 2x0 = 2x0 .
↑ ↑
0
g (x0 ) h0 (x0 )

2.3. Derivative of f (x) = cg(x)


Suppose that there is a number c such that
f (x) = cg(x)
for all real numbers x, and that g is differentiable at x0 . If x 6= x0
then
f (x) − f (x0 ) cg(x) − cg(x0 )
=
x − x0 x − x0
 
g(x) − g(x0 )
=c .
x − x0
| {z }
approaches g 0 (x0 )
2.4. THE PRODUCT RULE 15

It follows that
f 0 (x0 ) = cg 0 (x0 ).

Example 2.2. If
c

f (x) = 5 x2 ,

g(x)

then
c

f 0 (x0 ) = 5 · (2x0 ) = 10x0 .



g 0 (x0 )

2.4. The product rule


Suppose that
f (x) = g(x)h(x),
and that g and h are differentiable at x0 . If x 6= x0 then
f (x) − f (x0 ) g(x)h(x) − g(x0 )h(x0 )
= . (2.1)
x − x0 x − x0
We invoke the fundamental strategy of calculus to simplify the
expression on the right. Using the approximations
g(x) ≈ g(x0 ) + g 0 (x0 )(x − x0 )
and h(x) ≈ h(x0 ) + h0 (x0 )(x − x0 ),
we obtain
  
g(x)h(x) ≈ g(x0 ) + g 0 (x0 )(x − x0 ) h(x0 ) + h0 (x0 )(x − x0 )
= g(x0 )h(x0 )
+ g 0 (x0 )h(x0 )(x − x0 ) + g(x0 )h0 (x0 )(x − x0 )
+ g 0 (x0 )h0 (x0 )(x − x0 )2 .
16 2. FORMULAS FOR THE DERIVATIVE

It follows that
g(x)h(x) − g(x0 )h(x0 )
≈ g 0 (x0 )h(x0 ) + g(x0 )h0 (x0 )
x − x0
+ g 0 (x0 )h0 (x0 )(x − x0 ) .
| {z }
approaches 0

As x approaches x0 , the final term on the right approaches 0.


We discover that

f 0 (x0 ) = g 0 (x0 )h(x0 ) + g(x0 )h0 (x0 ).

This rule is known as the “product rule”.


Exercise 2.3. Use the product rule to compute the deriva-
tive of the function f (x) = x3 .

Solution: Notice that f (x) = g(x)h(x), where g(x) = x and


h(x) = x2 . We saw earlier (in equation (1.3)) that h0 (x0 ) = 2x0 ,
and from Exercise (1.2) we know that g 0 (x0 ) = 1. The product
rule tells us that
f 0 (x0 ) = 1 · x20 + x0 · 2x0 = 3x20 .
↑ ↑ ↑ ↑
g 0 (x0 ) h(x0 ) g(x0 ) h0 (x0 )

Exercise 2.4. Use the product rule and the result of the
previous exercise to compute the derivative of the function f (x) =
x4 . Conjecture a formula for the derivative of f (x) = xn , where
n is a nonnegative integer.

Solution: We can use the same approach again, writing f (x) =


g(x)h(x) where g(x) = x and h(x) = x3 . The product rule tells
us that
f 0 (x0 ) = 1 · x30 + x0 · (3x20 ) = 4x30 .
The pattern is now clear. The derivative of f (x) = xn , where n
is any nonnegative integer, is
f 0 (x0 ) = nxn−1
0 . (2.2)
2.4. THE PRODUCT RULE 17

Exercise 2.5. Assume that the functions g and h are dif-


ferentiable at x0 , and furthermore that h(x0 ) 6= 0. Let f be the
function defined by
g(x)
f (x) =
h(x)
for all numbers x such that h(x) 6= 0. Use the product rule to
compute the derivative f 0 (x0 ).

Solution: Notice that


f (x)h(x) = g(x). (2.3)
Differentiating both sides of (2.3), and using the product rule to
compute the derivative of the left-hand side, we obtain
f 0 (x0 )h(x0 ) + f (x0 )h0 (x0 ) = g 0 (x0 )
g(x0 ) 0
=⇒ f 0 (x0 )h(x0 ) + h (x0 ) = g 0 (x0 )
h(x0 )
=⇒ f 0 (x0 )h(x0 )2 + g(x0 )h0 (x0 ) = h(x0 )g 0 (x0 )
h(x0 )g 0 (x0 ) − g(x0 )h0 (x0 )
=⇒ f 0 (x0 ) = . (2.4)
h(x0 )2
This formula is known as the quotient rule.
Exercise 2.6. Use the quotient rule (2.4) and formula (2.2)
to compute the derivative of the function f (x) = 1/xm , where m
is a positive integer.

Solution: Assume that x0 6= 0 (otherwise f is not defined at x0 ).


The quotient rule with g(x) = 1 and h(x) = xm tells us that
h(x0 ) g 0 (x0 ) g(x0 ) h0 (x0 )
↓ ↓ ↓ ↓

xm
0 · 0 − 1 · (mxm−1 )
f 0 (x0 ) = m 2
0
= −mx0−m−1 .
(x0 )

h(x0 )2

This shows that the formula (2.2) holds also when n is negative.
18 2. FORMULAS FOR THE DERIVATIVE

2.5. The chain rule


One important way to build a new function out of simpler
functions is to take the output of one function and plug it in as
input to another function. Suppose that
f (x) = g(h(x))
and that h is differentiable at x0 and g is differentiable at h(x0 ).
The function f is said to be the “composition” of g and h. For
example, if h(x) = 1 + x2 and g(y) = 1/y, then f (x) = g(h(x)) =
1/(1 + x2 ).
If x 6= x0 then
f (x) − f (x0 ) g(h(x)) − g(h(x0 ))
= .
x − x0 x − x0
We will use the fundamental strategy to simplify the expression
on the right. If x is very close to x0 , then the point y = h(x) is
close to the point y0 = h(x0 ). We plug these values of y and y0
into Newton’s approximation
g(y) ≈ g(y0 ) + g 0 (y0 )(y − y0 )
to obtain
g(h(x)) ≈ g(h(x0 )) + g 0 (h(x0 ))(h(x) − h(x0 )),
which implies that
 
g(h(x)) − g(h(x0 )) h(x) − h(x0 )
≈ g 0 (h(x0 )) .
x − x0 x − x0
approaches h0 (x0 )

As x approaches x0 , the quotient on the right approaches h0 (x0 ).


We discover that

f 0 (x0 ) = g 0 (h(x0 ))h0 (x0 ).

This fact is known as the chain rule.


2.6. DERIVATIVE OF bx 19

2.6. Derivative of bx
Suppose that f (x) = bx , where b > 1 is a number. If x 6= x0 ,
then

f (x) − f (x0 ) bx − bx0


=
x − x0 x − x0
 x−x0 
x0 b −1
=b .
x − x0

The term bx0 does not depend on x. Thus,

bx−x0 − 1
 
0 f (x) − f (x0 )
f (x0 ) = lim = bx0 lim .
x→x0 x − x0 x→x0 x − x0
| {z }
annoying number

The “annoying number” on the right needs a name, so for the


moment let’s call it c. The expression for c might look unpleas-
ant, but keep in mind that c is just some number, and we can
approximate c by plugging in a particular value of x that is very
close to x0 .
It may seem that we are stuck with this unpleasant number,
but there is a ray of hope: the value of c depends on the value of
b. This raises a question: Is it possible to find a special value of b
such that c = 1? It would be great if this were the case, because
then c would vanish from sight and we would have the following
simple and neat result:

f (x0 ) = bx0 , f 0 (x0 ) = bx0

for every real number x0 . In other words, the function f would


be equal to its own derivative.
In fact, the answer to the question is yes: there is a special
value of b for which the number c turns out to be equal to 1.
Let’s find this special value of b. Assume that b is chosen so that
c = 1. Select a value of x that is very close to x0 , and to save
20 2. FORMULAS FOR THE DERIVATIVE

∆x (1 + ∆x)1/∆x
.1 2.5937424
.01 2.7048138
.001 2.7169239
.0001 2.7181459
Table 1. As ∆x approaches 0, the quantity (1 +
∆x)1/∆x approaches e ≈ 2.718.

writing let ∆x = x − x0 . Then


b∆x − 1
≈1
∆x
∆x
=⇒ b ≈ 1 + ∆x
=⇒ b ≈ (1 + ∆x)1/∆x .
The approximation improves as ∆x = x − x0 approaches 0. So
we have found that
b = lim (1 + ∆x)1/∆x .
∆x→0

This special value of b is universally known as “e”, or Euler’s


number. The number e is a fundamental mathematical constant,
like π. We estimate the value of e in table 1, where we see that
e ≈ 2.718. In conclusion: if f (x) = ex , then

f 0 (x0 ) = ex0 .

The function ex is called the “exponential function”, and thanks


to this special property it is arguably the most important function
in math:
The exponential function is its own derivative.

Exercise 2.7. Compute the derivative of the function f (x) =


e−x .
2.6. DERIVATIVE OF bx 21

Solution: Note that f (x) = g(h(x)) where h(x) = −x and


g(y) = ey . The chain rule tells us that

f 0 (x0 ) = e−x0 · (−1) = −e−x0 .


↑ ↑
g 0 (h(x0 )) h0 (x0 )

Exercise 2.8. The sigmoid function S is defined by


ex
S(x) = .
1 + ex
Its graph is shown in figure 3. The output of the sigmoid function

Figure 3. The sigmoid function S(x) = ex /(1 + ex ).

is always between 0 and 1, which makes the sigmoid function


useful for estimating probabilities (which are required to be
between 0 and 1). For this reason, the sigmoid function plays
an important role in machine learning, for example in logistic
regression and neural networks. (A typical application might be
computing the probability that an email is spam.)
Use the quotient rule to compute the derivative S 0 (x0 ).
22 2. FORMULAS FOR THE DERIVATIVE

Solution: The quotient rule tells us that

(1 + ex0 )ex0 − ex0 · ex0


S 0 (x0 ) =
(1 + ex0 )2
x0
e
= .
(1 + ex0 )2

We are done, but if we notice that


1 + ex ex 1
1 − S(x) = x
− = ,
1+e 1 + ex 1 + ex
then our formula for S 0 (x0 ) can be written as

ex0 1
S 0 (x0 ) = · = S(x0 )(1 − S(x0 )).
1 + ex0 1 + ex0
This formula is convenient because it shows that if we have
already evaluated S(x0 ), then hardly any additional arithmetic
operations are required to evaluate S 0 (x0 ). We can compute
S 0 (x0 ) very efficiently.

2.7. Derivative of log(x)


The defining property of the natural logarithm function
f (x) = log(x) is
ef (x) = x (2.5)

for every positive number x. Differentiating both sides of (2.5),


and using the chain rule to compute the derivative of the left-hand
side, we find that

ef (x) f 0 (x) = 1
=⇒ xf 0 (x) = 1

=⇒ f 0 (x) = x1 .
2.8. THE POWER RULE 23

2.8. The power rule


Let r be any real number and let f be the function defined
by f (x) = xr . (Here x can be any positive real number.) To
derive a formula for the derivative f 0 (x0 ), we first observe that
f (x) = (elog(x) )r = er log(x) = eh(x) ,
where h(x) = r log(x). The chain rule tells us that
f 0 (x) = eh(x) h0 (x)
r
= xr
x
= rxr−1 .
This formula is known as the power rule. As we have seen
previously (in exercises 2.4 and 2.6), if r is an integer then there
is no need to restrict x to be positive. (If r is not an integer, then
xr might not
√ even be defined when x is negative. For example,
(−1)1/2 = −1 is not a real number.)

Exercise 2.9. Use Newton’s approximation to estimate 50.

Solution: Let f (x) = x = x1/2 . From the power rule, f 0 (x0 ) =
−1/2
(1/2)x0 . Newton’s approximation (1.5) with x = 50 and
x0 = 49 yields
√ √ 1
50 ≈ 49 + √ · (50 − 49) .
↑ ↑ 2 49
f (x) f (x0 ) ↑ ↑
f 0 (x0 ) (x−x0 )

Simplifying, we obtain
√ 1
50 ≈ 7 + ≈ 7.07142.
14

The true value of 50 is 7.07106 . . ., so the approximation is not
bad.
Part 2

Vector calculus and linear


algebra
CHAPTER 3

Points and vectors

One of the miseries of life is that


everybody names things a little bit wrong,
and so it makes everything a little harder
to understand.

Richard Feynman

The word “vector” can have different meanings in different


mathematical contexts, which can be confusing. The same goes
for the word “point”. But it’s a shame for such simple concepts
to be a source of confusion. Here we will explain the meanings
of the words “point” and “vector” and pin down the definitions
that will be used throughout this book.
A classic example of a data science problem is predicting the
value of a house based on information such as the house’s square
footage, the age of the house, the distance of the house from
downtown, the number of restaurants within walking distance,
etc. When collecting data about a house, inevitably we write
down a list of numbers like this. An n-tuple is simply a list of n
numbers. For example, here is a 4-tuple: (850, 10, 15.7, 4). The
order in which the numbers are written down matters, so that a
rearrangement such as (15.7, 10, 4, 850) is considered to be a dif-
ferent n-tuple. Two n-tuples (x1 , x2 , . . . , xn ) and (y1 , y2 , . . . , yn )
are equal if and only if x1 = y1 , x2 = y2 , . . . , and xn = yn .
The set of all n-tuples of real numbers is denoted Rn . The
notation x ∈ Rn means that x is an element of the set Rn , so
27
28 3. POINTS AND VECTORS

x = (x1 , . . . , xn ) for some real numbers x1 , . . . , xn . The numbers


x1 , . . . , xn are called the “components” of x.
If we have carefully collected a large amount of data about a
house, we can easily find ourselves working with n-tuples where
n is some large number such as 50. Much larger values of n are
typical in other applications. For example, if instead of predicting
housing prices we are working on a computer vision problem,
such as developing an algorithm to recognize people in images,
we might describe an image by listing the RGB values for each
pixel in the image. This gives us an n-tuple where n is perhaps
one million.
When n = 2 or n = 3, it is possible to visualize an n-tuple.
In fact, there are two different methods to visualize an n-tuple:
the point picture and the vector picture, which we explain below.

3.1. Method 1: The point picture


The first method to visualize a 2-tuple such as (3, 2) is simply
to draw a coordinate system and then draw the point whose
coordinates are (3, 2), as shown in figure 4. In this viewpoint,
the ordered pair (3, 2) specifies a location. If we draw a third axis
coming out of the page, then we are able to visualize 3-tuples.
When we use Method 1 to visualize an n-tuple, we often
refer to the n-tuple as a “point”. So in our terminology, a “point”
really is nothing more than an n-tuple, but using the term “point”
provides a hint that it will be helpful to visualize the n-tuple as
a location in space.

3.2. Method 2: The vector picture


In the second method, we visualize a 2-tuple such as (3, 2)
by drawing a coordinate system and then drawing an arrow that
connects a starting point (selected arbitrarily) to an ending point
which is 3 units to the right and 2 units up from the starting
point. In this viewpoint, which is illustrated in figure 5, the
ordered pair (3, 2) tells us the displacement from a starting point
3.3. VECTOR OPERATIONS 29

Figure 4. The point picture: We visualize the point


in space whose coordinates are (3, 2). Starting at
the origin, you move 3 units to the right and 2 units
upwards to arrive at the location shown in red.

to an ending point. If we draw a third axis coming out of the


page, then we can use the same method to visualize 3-tuples.
When we use Method 2 to visualize an n-tuple, we often
refer to the n-tuple as a vector. In our terminology, a “vector”
is nothing more than an n-tuple, but using the term “vector”
suggests that we should visualize the n-tuple as the displacement
(drawn as an arrow) from one point to another.

3.3. Vector operations


The vector picture suggests doing certain things with vectors
that we would never think of doing with points. A shift in
viewpoint leads to new ideas.
3.3.1. Adding vectors. It does not seem to make sense
visually to add together two points. However, there is a perfectly
logical way to add two displacements: if x = (x1 , x2 ) is the
displacement from location A to location B, and y = (y1 , y2 ) is
30 3. POINTS AND VECTORS

Figure 5. The vector picture: We visualize the dis-


placement from a starting point (selected arbitrarily)
to an ending point. In this example, the arbitrarily
selected starting point has coordinates (1, 2). You
move 3 units to the right and 2 units upwards to
arrive at the ending point, which has coordinates
(4, 4).

the displacement from location B to location C, then x + y is the


total displacement from location A to location C. If an object is
located at point A and experiences a displacement of x, followed
by a displacement of y, then the object’s total displacement is
x + y. This is illustrated in figure 6. Hopefully figure 6 makes it
clear that
(x1 , x2 ) + (y1 , y2 ) = (x1 + x2 , y1 + y2 ).
A similar formula (and a similar picture) holds for vectors in
R3 . Although we cannot visualize n-tuples when n is greater
than 3, we still define the sum of vectors x = (x1 , . . . , xn ) and
y = (y1 , . . . , yn ) as follows:

(x1 , . . . , xn ) + (y1 , . . . , yn ) = (x1 + y1 , . . . , xn + yn ).


3.3. VECTOR OPERATIONS 31

Figure 6. Adding two vectors: The displacement


from A to B plus the displacement from B to C is
equal to the displacement from A to C.

3.3.2. Multiplying a vector by a number. Visually, it


would not make sense to multiply a point by a number. But it
is natural to multiply a vector x by a number c. Picturing x as
an arrow (representing a displacement), you just scale the length
of the arrow by c without changing the arrow’s direction. If x is
visualized as a displacement, then 2x is twice the displacement,
in the same direction. This is illustrated in figure 7. Hopefully
figure 7 makes it clear that if x = (x1 , x2 ) and c is a number,
then
cx = (cx1 , cx2 ).
Although we can’t visualize vectors in Rn when n > 3, we still
define
c(x1 , . . . , xn ) = (cx1 , . . . , cxn ).
If c is negative, then cx points in the direction opposite to that of
x. When a number c is multiplied by a vector x, often c is called
a “scalar” because the length of x gets scaled by the factor c.
32 3. POINTS AND VECTORS

Figure 7. Multiplying a vector by a scalar: The


magnitude (length) of the displacement is multi-
plied by c, while the direction of the displacement
is unchanged.

3.3.3. The length or “norm” of a vector. A point has


no “size”, and it would not make sense to describe one point
as being somehow larger or smaller than another point. But
some displacements are larger than other displacements. If a
vector x = (x1 , x2 ) is visualized as an arrow (representing a
displacement from one point to another), then the length of the
arrow tells us the size of the displacement. The length of a vector
x is also called the “norm” of x, and is denoted as kxk. Using
the Pythagorean theorem, we can see that the norm of the vector
x = (x1 , x2 ) is q
kxk = x21 + x22 ,
as illustrated in figure 8.
Figure 9 shows how we can use the Pythagorean theorem
twice to compute the length of a vector x = (x1 , x2 , x3 ), obtaining
the formula q
kxk = x21 + x22 + x23 .
3.3. VECTOR OPERATIONS 33

Figure 8. The length or “norm” of a vector x =


(x1 , x2 ): The Pythagorean theorem
p tells us that the
length of the arrow is kxk = √ x21 + x22 . In this
example, x = (3, 4) and kxk = 32 + 42 = 5.

Although we can’t visualize vectors in Rn when n > 3, we define


the norm of a vector x = (x1 , . . . , xn ) by
q
kxk = x21 + · · · + x2n .

Let’s check that if we multiply a vector x = (x1 , . . . , xn ) by a


scalar c, the norm of the resulting vector is |c|kxk. By definition,
cx = (cx1 , . . . , cxn ). It follows that
p
kcxk = (cx1 )2 + · · · + (cxn )2
q
= |c| x21 + · · · + x2n
= |c|kxk.

A “unit vector” is a vector u whose norm is 1. Unit vectors


are convenient for specifying a direction in space. If u is a unit
34 3. POINTS AND VECTORS

vector, and t is a scalar, then the norm of tu is equal to |t|:


ktuk = |t|kuk = |t| · 1 = |t|.
So tu is a vector that points in the direction u and has length
t. The vector tu represents a displacement of length t in the
direction u.

Figure 9. The length of a vector x = (x1 , x2 , x3 ):


We use the Pythagorean theorem p to see that the
length of the blue line is h = x21 + x22 . We then
use thepPythagoreanptheorem again to see that
kxk = h2 + x23 = √ x21 + x22 + x23 . In this par-
ticular
√ example, h = 32 + 42 = 5, and kxk =
5 + 122 = 13.
2

There are other popular ways to measure the “size” of a vector


x = (x1 , . . . , xn ). For example, the quantity |x1 | + · · · + |xn | is
called the “taxicab norm” of x. If x is the displacement from
point A to point B, then the taxicab norm tells us the distance we
must travel in order to get from A to B if we are only allowed to
move in directions that are parallel to one of the coordinate axes.
(Imagine driving in a city where all roads go either East-West or
3.3. VECTOR OPERATIONS 35

North-South.) The taxicab norm is more commonly called the `1 -


norm (a less descriptive name), and is denoted kxk1 . Another way
to measure the size of a vector x is using the “worst-case norm”,
which is equal to the largest (in absolute value) component of
x and is denoted kxk∞ . For example, if x = (−3, 5, −9), then
kxk∞ = 9. The worst-case norm is usually called the `∞ -norm.
The length of the vector x, which we have been simply calling
the “norm” of x, is more precisely called the `2 -norm of x, in
order to distinguish it from these other norms that we have now
discussed. The `2 -norm of x is commonly denoted kxk2 :
q
kxk2 = x21 + · · · + x2n = length of x,
but we will continue to use the notation kxk for the length of x.

3.3.4. Adding a vector to a point. It also makes sense


visually to add a vector to a point, to obtain a new point. If x
is a point and y is a vector, then x + y is the point you would
arrive at by starting at the location x and moving through a
displacement of y. This is illustrated in figure 10. Although
the geometric interpretation is different, the addition formula
remains the same:
(x1 , . . . , xn ) + (y1 , . . . , yn ) = (x1 + y1 , . . . , xn + yn ).
In calculus, we often want to compare the value of a function
f at a point x with the value of f at some nearby point that is
very close to x. Let u be a unit vector (so that kuk = 1) and
let t be a tiny number. Notice that ktuk = |t|, so that tu is a
tiny vector that represents a tiny displacement in the direction u.
Then x + tu is a point nearby x. Specifically, x + tu is the point
you would arrive at by starting at the location x and moving
a distance t in the direction u. If that is not perfectly clear,
take a moment to let it sink in, as this picture will be crucial in
multivariable calculus.

3.3.5. Subtracting a point from a point. Although it


does not make sense visually to add a point to a point, it does
36 3. POINTS AND VECTORS

Figure 10. Adding a point x and a vector y: x + y


is the point whose displacement from x is y. In this
example, x = (1, 2), y = (3, 2), and x + y = (4, 4).

make sense to subtract a point from a point to obtain a vector.


The previous section could be summarized as
point + vector = new point.
We can rewrite this equation as
new point − point = vector.
If x and y are points, then y − x is the displacement vector from
x to y. This is illustrated in figure 11. Notice that
x + (y − x) = y .
|{z} | {z } |{z}
point vector point

3.3.6. The dot product of two vectors. There is some-


thing called the “dot product” of vectors x and y that turns out
to be very useful. If x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ) then
the dot product of x and y is denoted hx, yi and is defined as
3.3. VECTOR OPERATIONS 37

Figure 11. Subtracting a point x from a point y:


y − x is the displacement vector from x to y. In this
example, x = (1, 2), y = (4, 4), and y − x = (3, 2).

follows:
hx, yi = x1 y1 + x2 y2 + · · · + xn yn . (3.1)

Notice that the dot product of x and y is a number, not a vector.


And why should we care about this number? The main reason is
that this number has a nice geometric interpretation: If θ is the
angle between the vectors x and y, then
hx, yi = kxkkyk cos(θ).
This is illustrated in figure 12. The geometric interpretation of
the dot product is not obvious, but we’ll show where it comes
from at the end of this section.
If x and y are perpendicular, then θ = π/2, and so cos(θ) = 0.
Using the geometric interpretation of the dot product, we see
that hx, yi = 0. This gives us a convenient way to check whether
or not two vectors are perpendicular — we simply check if their
dot product is equal to 0. Usually the word “orthogonal” is used
38 3. POINTS AND VECTORS

Figure 12. The geometric interpretation of the dot


product.

instead of “perpendicular”, perhaps because it sounds fancier,


but both words mean the same thing:

x is orthogonal to y ⇐⇒ hx, yi = 0.

Here is a question that will be important for vector calculus:


which way should y be pointing in order for hx, yi to be as large as
possible? Choosing y to be orthogonal to x would be a bad choice,
because then hx, yi = 0. Recalling that hx, yi = kxkkyk cos(θ),
and noting that −1 ≤ cos(θ) ≤ 1, we see that hx, yi is as large
as possible when cos(θ) = 1, or equivalently when θ is 0. This
means that y should point in the same direction as x. In this case,
hx, yi = kxkkyk. On the other hand, if we want hx, yi to be as
negative as possible, then we should pick y to point in the opposite
direction as x, so that cos(θ) = −1 and hx, yi = −kxkkyk.
We now mention a few rules that tend to be useful when
working with vectors. Using equation (3.1), we see immediately
that
hx, xi = kxk2
3.3. VECTOR OPERATIONS 39

for all vectors x ∈ Rn . We can also use equation (3.1) to easily


check that
• hx + y, zi = hx, zi + hy, zi for all x, y, z ∈ Rn .
• hcx, yi = chx, yi for all x, y ∈ Rn , c ∈ R.
Because hx, yi = hy, xi for all x, y ∈ Rn , we have the following
counterparts for the previous two rules:
• hx, y + zi = hx, yi + hx, zi for all x, y, z ∈ Rn .
• hx, cyi = chx, yi for all x, y ∈ Rn , c ∈ R.
The fact that the dot product formula is simple and has these
nice properties is one of the reasons that the dot product is so
ubiquitous. It’s important to internalize these simple rules so
that vector arithmetic becomes effortless.
Finally let’s justify the geometric interpretation of the dot
product. In figure 13, it would be incorrect to invoke the

Figure 13. The law of cosines is a generalization


of the Pythagorean theorem that holds for non-right
triangles. The geometric interpretation of the dot
product follows from the law of cosines.

Pythagorean theorem to conclude that ky − xk2 = kxk2 + kyk2 ,


40 3. POINTS AND VECTORS

because the triangle in figure 13 is not a right triangle. However,


there is a generalization of the Pythagorean theorem known as
the law of cosines which tells us that
ky − xk2 = kxk2 + kyk2 − 2kxkkyk cos(θ). (3.2)
Notice that when θ = π/2, the law of cosines reduces to the
Pythagorean theorem. We can expand the left-hand side of (3.2)
as follows:
ky − xk2 = hy − x, y − xi
= hy − x, yi − hy − x, xi
= hy, yi − hx, yi − hy, xi + hx, xi
= kxk2 + kyk2 − 2hx, yi.
Now comparing the left-hand side of (3.2) (in expanded form)
with the right-hand side, and canceling like terms from both
sides, we discover that hx, yi = kxkkyk cos(θ).
CHAPTER 4

The gradient vector

So far we have worked with functions that take a single


number as input (for example, the number t of seconds for which
a car has been driving) and return a single number as output
(for example, the car’s position in meters at time t). Our next
step is to consider functions that take a list of numbers as input
and return a single number as output. For example, we could
have a function f that takes the coordinates x = (x1 , x2 , x3 ) of a
point in space as input and returns the temperature at the point
x as output. If you are a mosquito who likes cool weather, you
might want to find a point x for which the temperature f (x) is
as small as possible, and fly to that location.

Example 4.1. Here is an example of how functions that


take a list of numbers as input appear in prediction problems
such as predicting the price of a house. Suppose that we have
collected the following data about a large number N of recently
sold houses:
• Si is the number of square feet of the ith house.
• Ti is the age of the ith house.
• Ri is the distance of the ith house from downtown.
• yi is the selling price of the ith house.
The numbers Si , Ti , and Ri describe the ith house, and our goal
is to use these numbers somehow to predict the selling price yi
of the ith house.
A simple and popular approach, called linear regression, is
to try to find numbers w0 , w1 , w2 , and w3 such that w0 + Si w1 +
41
42 4. THE GRADIENT VECTOR

Ti w2 + Ri w3 is a good approximation to yi , the price of the ith


house. We want:
yi ≈ w0 + Si w1 + Ti w2 + Ri w3 (4.1)
for i = 1, . . . , N . The challenge is to select the numbers w0 , w1 ,
w2 , w3 so that the approximation (4.1) is accurate for all of the
houses for which we have collected data. It is not good enough
for the prediction error w0 + Si w1 + Ti w2 + Ri w3 − yi to be small
for just some houses but not others. We want this prediction
error to be small for all houses. In other words, we want the
total prediction error
N
X
f (w0 , w1 , w2 , w3 ) = (w0 + Si w1 + Ti w2 + Ri w3 − yi )2
i=1

to be as small as possible.
This example shows how a goal of making accurate predic-
tions naturally leads us to the goal of minimizing an error function
such as f that takes a list of numbers as input. Many of the most
popular machine learning algorithms, such as neural networks,
are variations of this simple, classic idea.

4.1. The directional derivative


It will help the mosquito find a cool spot if he can feel
how rapidly the temperature is changing in various directions.
Suppose that we are currently located at a point
x = (x1 , . . . , xn ),
and the temperature at this point is f (x). (This function f is like
a thermometer.) We are wondering how rapidly the temperature
will increase or decrease if we move away from x in the direction
u. (Here u is a unit vector.) If we move a short distance of t
meters in the direction u, then our new location is x + tu, and
the temperature at our new location is f (x + tu). The change in
temperature is f (x + tu) − f (x). The average rate of change in
4.2. PARTIAL DERIVATIVES 43

the temperature is
change in temperature f (x + tu) − f (x) degrees
= .
distance traveled t meter
If we select values of t that are shorter and shorter, and approach-
ing 0, this average rate of change approaches the instantaneous
rate of change of f in the direction u. This instantaneous rate of
change is denoted Du f (x):

f (x + tu) − f (x)
Du f (x) = lim . (4.2)
t→0 t
The number Du f (x) is called the “directional derivative” of f at
x in the direction u.
Exercise 4.2. Let f : R2 → R be defined by
f (x1 , x2 ) = x1 x2 .
√ √
Compute Du f (3, 2), where u is the unit vector (1/ 2, 1/ 2).

Solution:
√ √
(3 + t/ 2)(2 + t/ 2) − 6
Du f (3, 2) = lim
t→0 t

6 + 5t/ 2 + t2 /2 − 6
= lim
t→0 t

= lim 5/ 2 + t/2
t→0

= 5/ 2.

4.2. Partial derivatives


In the special case that u = (1, 0, . . . , 0), computing the
directional derivative Du f (x) is particularly easy. With this
choice of u, we have
x + tu = (x1 , x2 , . . . , xn ) + t(1, 0, . . . , 0)
= (x1 + t, x2 , . . . , xn )
44 4. THE GRADIENT VECTOR

and
f (x1 + t, x2 , . . . , xn ) − f (x1 , x2 , . . . , xn )
Du f (x) = lim . (4.3)
t→0 t
The quantity on the right is the derivative of the function g :
R → R defined by
g(x1 ) = f (x1 , x2 , . . . , xn ). (4.4)
(When defining g, we are thinking of the numbers x2 , . . . , xn as
being held fixed, whereas x1 can be any real number.) In other
words,
Du f (x) = g 0 (x1 ).
This is nice because g is a function of a single variable, which
means that g 0 (x1 ) can be computed using the arsenal of formulas
for derivatives that we derived in chapter 2.
When u = (1, 0, . . . , 0), an alternative notation for the di-
rectional derivative Du f (x) is D1 f (x). Likewise, in the special
case where the vector u has a 1 in the ith position and zeros
elsewhere, an alternative notation for the directional derivative
Du f (x) is Di f (x). These numbers Di f (x) (for i = 1, . . . , n) are
called the partial derivatives of f at x. Computing Di f (x) is
easy for the same reason that computing D1 f (x) is easy:

To compute Di f (x), think of f as a function of xi alone


(with the other components of x held fixed to constant
values), and then take the derivative using single-variable
calculus techniques from chapter 2.

Note that another very common notation for Di f (x) that you
will see in other books is ∂f∂x(x)
i
.

Exercise 4.3. Let f : R3 → R be defined by


f (x1 , x2 , x3 ) = x1 ex2 x3 .
Compute the partial derivatives of f .
4.4. NEWTON’S APPROXIMATION WHEN f : Rn → R 45

Solution: Thinking of x1 ex2 x3 as a function of x1 alone, with x2


and x3 held fixed, we see that
D1 f (x) = ex2 x3 .
On the other hand, thinking of x1 ex2 x3 as a function of x2 alone,
we see that
D2 f (x) = x1 ex2 x3 x3 .
Finally, by thinking of x1 ex2 x3 as a function of x3 alone, we see
that
D3 f (x) = x1 ex2 x3 x2 .

4.3. Newton’s approximation for partial derivatives


Newton’s approximation for the function g in equation (4.4)
tells us that g(x1 + ∆x1 ) ≈ g(x1 ) + g 0 (x1 )∆x1 , or equivalently
f (x1 + ∆x1 , x2 , . . . , xn ) ≈ f (x) + D1 f (x)∆x1 .
Similarly, if we increase xi by a small amount ∆xi and leave the
other components of x unchanged, then we have the following
version of Newton’s approximation for partial derivatives:

f (x1 , . . . , xi + ∆xi , . . . , xn ) ≈ f (x) + Di f (x)∆xi . (4.5)

In words, if you start at a point x and move a distance ∆xi in


the direction of the ith axis, then the change in the value of f is
approximately Di f (x)∆xi .

4.4. Newton’s approximation when f : Rn → R


Newton’s approximation answers the fundamental question
of calculus: how much does the value of f change when its input
changes by a small amount ∆x? We now address this question
in the case where f : Rn → R and ∆x ∈ Rn . To keep notation
simple we look at the case where n = 2.
Let ∆f be the amount that f changes when its input changes
from point A = (x1 , x2 ) to point C = (x1 +∆x1 , x2 +∆x2 ). Notice
that
∆f = ∆f1 + ∆f2 ,
46 4. THE GRADIENT VECTOR

Figure 14. The change in temperature when mov-


ing from point A to point C is equal to the change
in temperature from point A to point B plus the
change in temperature from point B to point C.
Algebraically, the second term in red cancels with
the first term in blue.

where ∆f1 is the amount that f changes when its input changes
from point A to point B = (x1 + ∆x1 , x2 ), and ∆f2 is the amount
that f changes when its input changes from point B to point C.
This is illustrated in figure 14.
According to equation (4.5), if ∆x1 and ∆x2 are small num-
bers then
∆f1 ≈ D1 f (x1 , x2 )∆x1
and
∆f2 ≈ D2 f (x1 + ∆x1 , x2 )∆x2
≈ D2 f (x1 , x2 )∆x2 .
Putting these pieces together, we find that
∆f ≈ D1 f (x1 , x2 )∆x1 + D2 f (x1 , x2 )∆x2 . (4.6)
In words, as you move from A to B to C, the value of f changes
first by D1 f (x)∆x1 and then by D2 f (x)∆x2 .
The expression on the right in (4.6) looks like the dot product
of two vectors. This suggests that equation (4.6) can be written
4.6. THE DIRECTION OF STEEPEST ASCENT 47

more concisely if we introduce the vector (D1 f (x), . . . , Dn f (x)),


which we shall call the gradient of f at x and which is denoted
by ∇f (x). The gradient vector ∇f (x) is just a list of all the
partial derivatives of f at x. With this notation, equation (4.6)
becomes
∆f ≈ h∇f (x), ∆xi. (4.7)
Equivalently:
f (x + ∆x) ≈ f (x) + h∇f (x), ∆xi. (4.8)
This is Newton’s approximation in the case that f : Rn → R.
(Although we took n = 2 for simplicity, a similar derivation works
for any value of n.)

4.5. A formula for directional derivatives


We can easily discover a formula for directional derivatives
by using Newton’s approximation with ∆x = tu:
f (x + tu) − f (x)
Du f (x) = lim
t→0 t
f (x) + h∇f (x), tui − f (x)
= lim
t→0 t
th∇f (x), ui
= lim
t→0 t
= h∇f (x), ui.
According to this formula, to compute the directional derivative
Du f (x) we can just take the dot product of ∇f (x) with u:

Du f (x) = h∇f (x), ui.

4.6. The direction of steepest ascent


For which direction u is the directional derivative Du f (x) as
large as possible? Notice that if u is a unit vector then
Du f (x) = h∇f (x), ui = k∇f (x)kkuk cos(θ) = k∇f (x)|| cos(θ),
48 4. THE GRADIENT VECTOR

where θ is the angle between ∇f (x) and u. The term cos(θ)


is always between −1 and 1, so the largest possible value that
Du f (x) can have is k∇f (x)k · 1. This occurs when θ = 0, which
means that u points in the same direction as ∇f (x). Thus:
The gradient vector points in the direction of steepest
ascent.
Moreover, the magnitude of the gradient has a meaning also: if
u is a unit vector that points in this direction of steepest ascent,
then Du f (x) = k∇f (x)k.
So the gradient vector tells you two things: the direction of
steepest ascent, and the rate of change of f in that direction.
With this visual interpretation, the gradient vector has sprung
to life as a geometric object.
Similarly, −∇f (x) points in the direction of steepest descent.
If f (x) is the temperature at the point x, and you are a mosquito
who likes cool temperatures, you will want to fly in the direction
−∇f (x).
CHAPTER 5

The Jacobian matrix

Often we encounter functions that take a list of numbers as


input and return a list of numbers as output. For example, the
input could be an image of a handwritten digit (stored as a list
of pixel intensity values), and the output could be a list of ten
probabilities: the probability that the digit is a 0, the probability
that it is a 1, etc. Our goal now is to discover a version of
Newton’s approximation for such a function f : Rn → Rm . As
always, the purpose of Newton’s approximation is to estimate
the value of f (x + ∆x).
To be concrete, suppose that f : R3 → R3 . If x is a point in
3
R , then f (x) is a list of three numbers, which could be called
f1 (x), f2 (x), and f3 (x). The functions fi : R3 → R defined in
this way are called the component functions of f :

f (x) = (f1 (x), f2 (x), f3 (x)) for all points x in R3 .

In other words, fi (x) is by definition the ith component of the


point f (x).
If ∆x = (∆x1 , ∆x2 , ∆x3 ) is a small vector in R3 , then New-
ton’s approximation (4.8) for the function f1 tells us that

f1 (x + ∆x) ≈ f1 (x) + D1 f1 (x)∆x1 + D2 f1 (x)∆x2 + D3 f1 (x)∆x3 .

We have similar approximations for f2 (x + ∆x) and f3 (x + ∆x).


Using these approximations for the component functions of f
allows us to approximate f (x + ∆x). At this point, our equations
will start to look nicer if we sometimes write vectors vertically
49
50 5. THE JACOBIAN MATRIX

instead of horizontally:
 
f1 (x + ∆x)
f (x + ∆x) = f2 (x + ∆x)
f3 (x + ∆x)
 
f1 (x) + D1 f1 (x)∆x1 + D2 f1 (x)∆x2 + D3 f1 (x)∆x3
≈ f2 (x) + D1 f2 (x)∆x1 + D2 f2 (x)∆x2 + D3 f2 (x)∆x3 
f3 (x) + D1 f3 (x)∆x1 + D2 f3 (x)∆x2 + D3 f3 (x)∆x3
   
f1 (x) D1 f1 (x)∆x1 + D2 f1 (x)∆x2 + D3 f1 (x)∆x3
= f2 (x) + D1 f2 (x)∆x1 + D2 f2 (x)∆x2 + D3 f2 (x)∆x3  .
f3 (x) D1 f3 (x)∆x1 + D2 f3 (x)∆x2 + D3 f3 (x)∆x3
| {z } | {z }
f (x) Need concise notation for this.

The messy-looking expression on the right can be written more


concisely if we introduce some new notation. There is something
wasteful about the expression on the right, because the symbols
∆x1 , ∆x2 , and ∆x3 are written repeatedly, once on each row.
We have wasted ink. The same information can be conveyed
with less writing if we only write down the arrays of numbers
   
D1 f1 (x) D2 f1 (x) D3 f1 (x) ∆x1
D1 f2 (x) D2 f2 (x) D3 f2 (x) and ∆x2 
D1 f3 (x) D2 f3 (x) D3 f3 (x) ∆x3
and place them side by side. In other words, we will now declare
that the expression
  
D1 f1 (x) D2 f1 (x) D3 f1 (x) ∆x1
D1 f2 (x) D2 f2 (x) D3 f2 (x) ∆x2  (5.1)
D1 f3 (x) D2 f3 (x) D3 f3 (x) ∆x3
means the same thing as
 
D1 f1 (x)∆x1 + D2 f1 (x)∆x2 + D3 f1 (x)∆x3
D1 f2 (x)∆x1 + D2 f2 (x)∆x2 + D3 f2 (x)∆x3  . (5.2)
D1 f3 (x)∆x1 + D2 f3 (x)∆x2 + D3 f3 (x)∆x3
A 3 × 3 array of numbers such as the one on the left in (5.1) is
called a matrix. With this new “matrix notation”, Newton’s
5. THE JACOBIAN MATRIX 51

approximation for our function f : R3 → R3 can now be written


as
      
f1 (x + ∆x) f1 (x) D1 f1 (x) D2 f1 (x) D3 f1 (x) ∆x1
f2 (x + ∆x) ≈ f2 (x)+D1 f2 (x) D2 f2 (x) D3 f2 (x)∆x2 
f3 (x + ∆x) f3 (x) D1 f3 (x) D2 f3 (x) D3 f3 (x) ∆x3
| {z } | {z } | {z }| {z }
f (x+∆x) f (x) matrix ∆x

Comparing this with our familiar way of writing Newton’s


approximation,
f (x + ∆x) ≈ f (x) + f 0 (x)∆x,
suggests that the matrix above should in fact be denoted as f 0 (x)
and should be called the derivative of f at x. In summary, when
f : R3 → R3 , we define f 0 (x) to be the 3 × 3 matrix:
 
D1 f1 (x) D2 f1 (x) D3 f1 (x)
f 0 (x) = D1 f2 (x) D2 f2 (x) D3 f2 (x) .
D1 f3 (x) D2 f3 (x) D3 f3 (x)
Similarly, when f : Rn → Rm , we define f 0 (x) to be the following
rectangular array of numbers:
 
D1 f1 (x) · · · Dn f1 (x)
f 0 (x) =  .. .. ..
.
 
. . .
D1 fm (x) · · · Dn fm (x)
A rectangular array of numbers with m rows and n columns is
called an m × n matrix. The immediate triumph of our new
matrix notation is that Newton’s approximation for a function
f : Rn → Rm looks identical to Newton’s approximation for a
function f : R → R:
vector vector vector
inRn inRn inRn
y y y
f ( x + ∆x ) ≈ f (x) + f 0x
(x) ∆x . (5.3)
| {z
x } |{z}x 
  m×n
vector vector matrix
in Rm in Rm
52 5. THE JACOBIAN MATRIX

We emphasize the following fact:

If f : Rn → Rm , then f 0 (x) is an m × n matrix.

This matrix f 0 (x) is often called the “Jacobian” or the “Jacobian


matrix” of f at x. However, I think a much better name for f 0 (x)
is simply “the derivative of f at x”.
Example 5.1. If f : Rn → R, then f 0 (x) is a 1 × n matrix.
In detail,
f 0 (x) = D1 f (x) D2 f (x) · · · Dn f (x) .
 

A matrix with just one row is also called a “row vector”.


Example 5.2. Let f : Rn → R be the function defined by
1 1
f (x) = kxk2 = (x21 + x22 + · · · + x2n ).
2 2
(The numbers x1 , . . . , xn are the components of the vector x.)
Thinking of f as a function of x1 alone, with x2 , . . . , xn held
fixed, we see that
D1 f (x) = x1 .
Likewise, the ith partial derivative of f is Di f (x) = xi (for
i = 1, . . . , n). Thus,
f 0 (x) = x1 x2 · · · xn .
 
CHAPTER 6

Matrix multiplication

6.1. A matrix wants to operate on a vector


Our “matrix notation” has already saved writing, but matri-
ces come to life if we take the viewpoint that the matrix in (5.1)
is performing an operation on the vector ∆x, resulting in the new
vector (5.2). In this viewpoint, a matrix is not an inert object.
It has a mission in life: to operate on a vector. We shall call this
operation “multiplication” (reusing an old word), and we shall
say that in the expression (5.1) we are “multiplying” the matrix
on the left by the vector ∆x.
An m × n matrix is, by definition, a rectangular array of
numbers with m rows and n columns. The set of all possible
m × n matrices is denoted Rm×n . If a matrix
 
a11 a12 ··· a1n
 a21 a22 ··· a2n 
A= .
 
.. .. .. 
 .. . . . 
am1 am2 ··· amn

is “multiplied” by a vector

 
x1
 x2 
x =  .  ∈ Rn
 
 .. 
xn
53
54 6. MATRIX MULTIPLICATION

then the result is the vector Ax ∈ Rm defined by


    
a11 a12 · · · a1n x1 a11 x1 + a12 x2 + · · · + a1n xn
 a21 a22 · · · a2n   x2   a21 x1 + a22 x2 + · · · + a2n xn 
..   ..  =  .
    
 .. .. .. ..
 . . . .   .   . 
am1 am2 · · · amn xn am1 x1 + am2 x2 + · · · + amn xn
| {z } | {z } | {z }
A x Ax

This is the definition of multiplying a matrix by a vector, an


operation that we invented when we declared that expression (5.1)
means the same thing as expression (5.2). Notice the following
pattern:

The ith entry of the vector Ax is the dot product of the


ith row of A with the vector x.
Throughout math, calculations which can be expressed as
multiplying a matrix by a vector are ubiquitous. This is true in
part because Newton’s approximation is so ubiquitious.

6.2. Some useful rules of arithmetic


It’s not hard to check that the following basic arithmetic
rules are satisfied.

6.2.1. Distributive rule. If x and y are vectors in Rn then


A(x + y) = Ax + Ay. (6.1)
This rule is called the “distributive” property of matrix-vector
multiplication.
We have a similar distributive rule for multiplying A by a
sum of any finite number of vectors. For example, if x, y, z ∈ Rn
then
A(x + y + z) = Ax + Ay + Az.
(Proof: Just use rule (6.1) twice, to obtain A(x + y + z) =
A(x + y) + Az = Ax + Ay + Az.)
6.3. ANOTHER PERSPECTIVE ON MATRIX-VECTOR MULTIPLICATION
55

6.2.2. Multiplying by a scalar. If we define the product


of a scalar c with a matrix A in the obvious way, so that cA is
the matrix obtained by multiplying each entry of A by c, then
we have
A(cx) = c(Ax) = (cA)x.
It follows from this rule that the expression cAx is unambiguous,
because either interpretation (cA)x or c(Ax) yields the same
result.
Exercise 6.1. Suppose that x1 , x2 , x3 are vectors in Rn and
c1 , c2 , and c3 are scalars (that is, real numbers). Explain why
A(c1 x1 + c2 x2 + c3 x3 ) = c1 Ax1 + c2 Ax2 + c3 Ax3 .

6.3. Another perspective on matrix-vector


multiplication
There is a different, more visual way to think about multiply-
ing a matrix by a vector that turns out to be very useful. Notice
that
 
a11 x1 + a12 x2 + · · · + a1n xn
 a21 x1 + a22 x2 + · · · + a2n xn 
Ax = 
 
.. 
 . 
am1 x1 + am2 x2 + · · · + amn xn
     
a11 a12 a1n
 a21   a22   a2n 
= x1  .  + x2  .  + · · · + xn  .  . (6.2)
     
 ..   ..   .. 
am1 am2 amn
Thus,

Ax is a linear combination of the columns of A.


This is, in fact, my favorite way to think about multiplying a
matrix by a vector. I find that it is usually the most illuminating
viewpoint. I will call this the “visual interpretation” of matrix-
vector multiplication.
56 6. MATRIX MULTIPLICATION

We can state equation (6.2) more concisely if we let aj denote


the jth column of A, i.e.,
 
a1j
 a2j 
aj =  .  for j = 1, . . . , n.
 
 .. 
amj
Then we can write A using “block notation” as
 
A = a1 a2 · · · an
and equation (6.2) becomes

Ax = x1 a1 + x2 a2 + · · · + xn an . (6.3)

Exercise 6.2. Suppose that M1 and M2 are m × n matrices


and M1 x = M2 x for all vectors x ∈ Rn . Show that M1 = M2 .

Solution: Take x = (1, 0, . . . , 0). Then from equation (6.3) we


see that M1 x is equal to the first column of M1 and M2 x is the
first column of M2 . So the first column of M1 is equal to the
first column of M2 . A similar argument shows that the second
column of M1 is equal to the second column of M2 , and so on.
Thus, M1 = M2 .

6.4. Multiplying a matrix by a matrix


Now suppose that B ∈ Rk×m . If x ∈ Rn , then
B(Ax) = B(x1 a1 + x2 a2 + · · · + xn an )
= x1 Ba1 + x2 Ba2 + · · · + xn Ban (6.4)
= Mx (6.5)

where M is the matrix whose jth column is Baj . (In going


from (6.4) to (6.5), we have used the “visual interpretation” of
matrix-vector multiplication.)
6.5. WHEN MULTIPLYING MATRICES, ORDER MATTERS 57

In other words, multiplying first by A and then by B yields


the same result as simply multiplying by M :

B(Ax) = M x

for all x ∈ Rn . This matrix M is called the “product” of B and


A and is denoted as BA. We again reuse the word “multiply”,
and say that M is the result of “multiplying” B by A. Using
block notation, the definition of BA is
 
BA = Ba1 Ba2 · · · Ban .
We can only multiply a matrix B by a matrix A if their
shapes are compatible, meaning that the number of columns of
B is equal to the number of rows of A. Otherwise, BA is not
defined.

6.5. When multiplying matrices, order matters


Warning: The order in which we multiply matrices matters.
It is usually not true that BA = AB. In fact, the product AB
might not even be defined (even if BA is defined).
However, it is always true that

C(BA) = (CB)A,

provided that A, B, and C are matrices with compatible shapes.


To see this, just check that multiplying a vector x by the matrix
on the left always yields the same result as multiplying x by
the matrix on the right. The result of multiplying x by the
matrix C(BA) is
(C(BA)) x = C ((BA)x)
= C (B(Ax)) .
Meanwhile, the result of multiplying x by the matrix (CB)A is
((CB)A) x = (CB)(Ax)
= C(B(Ax)).
So we get the same result either way.
58 6. MATRIX MULTIPLICATION

6.6. Conventions about column vectors and row


vectors
A “column vector” is a matrix with just one column. From
now on, we shall declare that the elements of Rn are column
vectors. With this convention, matrix-vector multiplication is a
special case of matrix-matrix multiplication. If x ∈ Rn , then x is
an n × 1 matrix (that is, column vector), and when we compute
Ax we are multiplying an m × n matrix A by an n × 1 matrix x.
A “row vector” is a matrix with just one row. Notice that
if u and v are vectors in Rn (so u and v are column vectors),
we can compute their dot product by first flipping u sideways,
turning it into a row vector, and then multiplying the resulting
row vector by v:
 
v1
   v2 

u1 u2 · · · un  .  = u1 v1 + u2 v2 + · · · + un vn = hu, vi.

 .. 
vn
On the left we are multiplying a 1 × n matrix by an n × 1 matrix.
The row vector obtained by flipping u sideways is called the
“transpose” of u and is denoted uT . With this notation, the above
equation tells us that
hu, vi = uT v. (6.6)
T T
Notice that u v = v u, because hu, vi = hv, ui.
Suppose that f : Rn → R and x ∈ Rn . Because we have
n
declared that the
 elements of R are column vectors, the row
0
vector f (x) = D1 f (x) · · · Dn f (x) is not an element of Rn .
We shall adopt the convention the gradient of f at x is a column
vector:  
D1 f (x)
 D2 f (x) 
∇f (x) =   = f 0 (x)T .
 
..
 . 
Dn f (x)
6.6. CONVENTIONS ABOUT COLUMN VECTORS AND ROW VECTORS
59

With this convention, Newton’s approximation f (x + ∆x) ≈


f (x) + f 0 (x)∆x can be written equivalently as
f (x + ∆x) ≈ f (x) + ∇f (x)T ∆x.
Defining the gradient to be a column vector is convenient
because it will allow us to make statements such as the following:
if we are located at x and we move a short distance in the direction
of steepest descent for f , then our new location is x − t∇f (x) (for
some scalar t). Repeatedly moving in the direction of steepest
descent is a good strategy for finding a point x∗ at which f has
a minimum value. This strategy is called “gradient descent” and
is fundamental in machine learning.
Exercise 6.3. Suppose that u, v ∈ Rn and c ∈ R. Explain
why
(u + v)T = uT + v T and (cu)T = cuT .

Solution: What difference does it make if we flip then add or add


then flip? Likewise, what difference does it make if we flip then
scale by c or if we scale by c and then flip? We get the same
result either way.
Exercise 6.4. Suppose that u1 , u2 , u3 ∈ Rn and c1 , c2 , c3 ∈
R. Explain why
(c1 u1 + c2 u2 + c3 u3 )T = c1 uT1 + c2 uT2 + c3 uT3 .
Solution: As with the previous exercise, this fact might seem

obvious, because what difference does it make if our column


vectors are tipped sideways before or after we combine them?
Another way to think about it is to use the results of the previous
exercise repeatedly:
(c1 u1 + c2 u2 + c3 u3 )T = (c1 u1 + c2 u2 )T + (c3 u3 )T
= (c1 u1 )T + (c2 u2 )T + (c3 u3 )T
= c1 uT1 + c2 uT2 + c3 uT3 .
60 6. MATRIX MULTIPLICATION

6.7. Transposing matrices


In the previous section we transposed a column vector to
obtain a row vector. Now we will ask what is the transpose of
Ax, where A is an m × n matrix A and x ∈ Rn . This question
leads us to discover the “transpose” of a matrix A.
6.7.1. Visual interpretation of z T A. The “visual inter-
pretation” of matrix-vector multiplication tells us that Ax is a
linear combination of the columns of A. (See section 6.3.) If
z ∈ Rm , there is a similar “visual interpretation” of the product
z T A:
z T A is a linear combination of the rows of A. (6.7)
To see this concretely, let’s look at the special case where m = 3
and n = 2. Then z and A can be written in detail as
   
z1 a11 a12
z = z2  , A = a21 a22 
z2 a31 a32
and
 
 a11 a12
z T A = z1

z2 z3 a21 a22 
a31 a32
 
= z1 a11 + z2 a21 + z3 a31 , z1 a12 + z2 a22 + z3 a32
     
= z1 a11 a12 +z2 a21 a22 +z3 a31 a32 .
| {z } | {z } | {z }
first row second row third row
of A of A of A

6.7.2. The transpose of a matrix. Sometimes we might


need to compute the transpose of the column vector Ax. Let aj
be the jth column of an m × n matrix A. Then
(Ax)T = (x1 a1 + x2 a2 + · · · + xn an )T
= x1 aT1 + x2 aT2 + · · · + xn aTn .
But here we have a linear combination of row vectors. According
to the “visual interpretation” above, this linear combination of
6.7. TRANSPOSING MATRICES 61

row vectors is equal to xT M , where M is the matrix whose rows


are aT1 , aT2 , . . . , aTn . This matrix M is called the “transpose” of
A, and is denoted AT . With this notation, we have

(Ax)T = xT AT .

Notice that the first column of A is the first row of AT , the


second column of A is the second row of AT , and so on. For
example,
 T  
a b a c
= .
c d b d

Exercise 6.5. Suppose that A is an m × n matrix, and that


x ∈ Rn and y ∈ Rm . Show that

hAx, yi = hx, AT yi.

(As you go deeper in math, this turns out to be in some sense


the most essential fact about the transpose of a matrix. When I
think about AT , I usually think about this equation.)

Solution: Using equation (6.6), we have

hAx, yi = (Ax)T y
= (xT AT )y
= xT (AT y)
= hx, AT yi.

Exercise 6.6. Suppose that A is an m × n matrix and B is


a k × m matrix. Show that

(BA)T = AT B T .
62 6. MATRIX MULTIPLICATION

Solution: If x ∈ Rn and y ∈ Rk , then


h(BA)x, yi = hB(Ax), yi
= hAx, B T yi
= hx, AT B T yi.
On the other hand,
h(BA)x, yi = hx, (BA)T yi.
So we have discovered that
hx, (BA)T yi = hx, (AT B T )yi (6.8)
n k T T T
for all x ∈ R , y ∈ R . This suggests that (BA) = A B .
To finish off the argument, take
   
1 1
0 0
y =  .  ∈ Rk and x =  .  ∈ Rn .
   
 ..   .. 
0 0
Then (BA)T y is the first column of (BA)T , and hx, (BA)T yi is
the first entry of the first column of (BA)T . Likewise, hx, (AT B T )yi
is the upper left entry of AT B T . Thus, (BA)T and AT B T have
the same upper left entry. A similar argument shows that all of
the corresponding entries of (BA)T and AT B T are equal.

6.8. Matrix addition


Having discussed matrix multiplication, we should also men-
tion the simpler operation of matrix addition. Suppose A and B
are m × n matrices. We define A + B to be the m × n matrix
obtained by adding together the corresponding entries of A and
B. For example, if m = n = 2 then
     
a11 a12 b11 b12 a11 + b11 a12 + b12
+ = .
a21 a22 b21 b22 a21 + b21 a22 + b22
| {z } | {z } | {z }
A B A+B
6.9. ADDITIONAL EXERCISES 63

Note that we can only add together matrices which have the
same shape. We could not, for example, add a row vector to a
column vector.
Exercise 6.7. Suppose that x ∈ Rn . Explain why
(A + B)x = Ax + Bx.

Solution: It’s probably more clear to check this for yourself than
to read my explanation. Nevertheless, let aTi and bTi be the ith
rows of A and B, respectively. Then the ith row of A + B is
aTi + bTi , and the ith entry of (A + B)x is (aTi + bTi )x = aTi x + bTi x.
But this is the sum of the ith entries of Ax and Bx.
Exercise 6.8. Suppose C is an n × k matrix. Explain why
(A + B)C = AB + AC.

Solution: Let ci be the ith column of C. The ith column of


(A + B)C is (A + B)ci = Aci + Bci . But this is the same as the
ith column of AC + BC.
Exercise 6.9. Suppose C is a k × m matrix. Explain why
C(A + B) = CA + CB.

Solution: Let aj and bj be the jth columns of A and B, respec-


tively. The jth column of C(A + B) is C(aj + bj ) = Caj + Cbj .
But this is the same as the jth column of CA + CB.

6.9. Additional exercises


Exercise 6.10. The arithmetic rules given in section 6.2 can
be summarized as stating that if A ∈ Rn×m , then the function
L : Rn → Rm defined by L(x) = Ax is a “linear transformation”,
which means that
(1) L(x + y) = L(x) + L(y) for all vectors x, y ∈ Rn .
(2) L(cx) = cL(x) for all scalars c ∈ R and vectors x ∈ Rn .
Suppose that L : Rn → Rm is a linear transformation. Show
that there exists a matrix A such that L(x) = Ax for all x ∈ Rn .
64 6. MATRIX MULTIPLICATION

Solution: Let ej be the jth standard basis vector for Rn , so that


ej has a 1 in the jth position and zeros elsewhere. If x ∈ Rn ,
then
 
x1
 x2 
x =  .  can be written as x = x1 e1 + x2 e2 + · · · + xn en .
 
 .. 
xn
It follows that
L(x) = L(x1 e1 + x2 e2 + · · · + xn en )
= x1 L(e1 ) + x2 L(e2 ) + · · · + xn L(en )
= Ax,
where A is the matrix whose jth column is L(ej ).
CHAPTER 7

The chain rule

We will now discover one of the most useful rules for comput-
ing derivatives in multivariable calculus. Suppose that h : Rn →
Rm and g : Rm → Rk , and suppose that
f (x) = g(h(x))
for all x ∈ R . Notice that f takes as input a point in Rn and
n

returns as output a point in Rk . Let ∆x be a small vector in Rn .


We will approximate f (x + ∆x) by using Newton’s approximation
twice, first with h and then with g:
f (x + ∆x) = g(h(x + ∆x))
≈ g(h(x) + h0 (x)∆x)
≈ g(h(x)) + g 0 (h(x))h0 (x)∆x.
In the final step, we used the approximation g(z + ∆z) ≈ g(z) +
g 0 (z)∆z with z = h(x) and ∆z = h0 (x)∆x. Comparing the
approximation
f (x + ∆x) ≈ g(h(x)) +g 0 (h(x))h0 (x)∆x
| {z }
f (x)

with Newton’s approximation


f (x + ∆x) ≈ f (x) + f 0 (x)∆x
reveals (or at least suggests) that

f 0 (x) = g 0 (h(x))h0 (x). (7.1)

This fact is known as the chain rule.


65
66 7. THE CHAIN RULE

It is a triumph of matrix notation that the multivariable


chain rule can be discovered so easily and that it looks identical
to the chain rule in single-variable calculus. Notice that in equa-
tion (7.1) f 0 (x) is a matrix, and on the right we are multiplying
two matrices:
f 0 (x) = g 0 (h(x)) h0 (x) .
| {z } | {z } | {z }
k×n k×m m×n
matrix matrix matrix

Exercise 7.1. Suppose that a function f : Rn → R tells us


the temperature at each point in Rn . A mosquito is located at a
point x at time t = 0 and is moving with constant velocity vector
v ∈ Rn . So the mosquito’s position at time t is x + tv, and the
mosquito’s temperature at time t is
F (t) = f (x + tv).
a) How rapidly is the mosquito’s temperature changing at
time t = 0?
b) Suppose the temperature is as cold as possible at the
point x (so x is a minimizer for the function f ). What
can we conclude about the value of ∇f (x)?

Solution: Notice that F (t) = f (h(t)) where h : R → Rn is the


function defined by
h(t) = x + tv.
It is straightforward to check that h0 (t) = v. By the chain rule,
n×1

F 0 (t) = f 0 (x + tv)h0 (t) = f 0 (x + tv) v
1×n

so the rate of change of the mosquito’s temperature at time t = 0


is
n×1

F 0 (0) = f 0 (x) v = h∇f (x), vi.
1×n
7. THE CHAIN RULE 67

This formula h∇f (x), vi is the same directional derivative formula


that we derived in section 4.5.
If x is a minimizer for f , then the function F has a minimum
value at t = 0. So, from single-variable calculus, we know that
F 0 (0) = h∇f (x), vi = 0. But notice that v could be any vector
in Rn . The only way that h∇f (x), vi can be 0 for every vector
v ∈ Rn is if ∇f (x) = 0. So we can conclude that ∇f (x) = 0.
CHAPTER 8

Minimizing a function

Suppose again that a function f : Rn → R tells us the


temperature at each point x ∈ Rn . (I am visualizing the case
where n = 3.) Imagine that a mosquito who likes cool weather
has found a point x∗ in the shade where the temperature is as
low as possible. This point x∗ is a minimizer for the function
f . If the mosquito were to move a slight bit in the direction of
any given unit vector u, the value of f could not decrease. Thus,
Du f (x∗ ) ≥ 0 for every unit vector u ∈ Rn . Moreover, if u ∈ Rn ,
it is impossible that Du f (x∗ ) > 0, because then the directional
derivative of f in the opposite direction −u would be negative.
Therefore,
Du f (x∗ ) = h∇f (x∗ ), ui = 0 for all u ∈ Rn .
This is only possible if ∇f (x∗ ) = 0. So, in conclusion:

If x∗ ∈ Rn is a minimizer for a function f : Rn → R then


∇f (x∗ ) = 0.

See exercise (7.1) for a slightly different derivation of this fact.


This conclusion holds regardless of whether x∗ is a “global” min-
imizer for f (which means that f (x) ≥ f (x∗ ) for all x ∈ Rn ) or
just a “local” minimizer for f (which means that f (x) ≥ f (x∗ )
for all x in the near vicinity of x∗ ).
This suggests the following strategy for finding a minimizer
of f : we compute the gradient of f , then set the gradient of
f equal to 0 and solve the resulting system of equations for x.
However, we must be careful, because not every point x which
69
70 8. MINIMIZING A FUNCTION

f (x)

Figure 15. If f (x) = x3 , then f 0 (0) = 0, but 0 is


neither a minimizer nor a maximizer for f .

satisfies ∇f (x) = 0 is a minimizer for f . Indeed, a similar


argument to the one given above shows that if x is a maximizer
for f then ∇f (x) = 0. It is also possible for a point x which
satisfies ∇f (x) = 0 to be neither a maximizer nor a minimizer
for f . For example, if n = 1 and f (x) = x3 , then f 0 (0) = 0 but 0
is neither a minimizer nor a maximizer for f . This is illustrated
in figure 15.
Another example is provided by the function f : R2 → R
defined by f (x1 , x2 ) = x1 x2 . If x1 and x2 are both positive, then
f (x1 , x2 ) > 0. On the other hand, if x1 is positive and x2 is
negative, then f (x1 , x2 ) < 0. Thus, any ball centered at the
origin contains points where f is positive and also points where f
is negative. So the origin is neither a minimizer nor a maximizer
for f .
Nevertheless, i
APPENDIX A

Algebra review

Calculus is much easier to learn if computations using high


school algebra are effortless. (If not, that’s ok, calculus provides
a good chance to practice algebra and you can always backtrack
to fill in any gaps in knowledge.) We review a few formulas from
algebra below.

A.1. FOIL
The FOIL (“first-outer-inner-last”) formula states that if
x, y, z and w are numbers then

(x + y)(z + w) = xz + xw + yz + yw.

This rule can be derived by repeated use of the distributive rule


a(b + c) = ab + ac.
In detail, we derive FOIL as follows:
(x + y)(z + w) = (x + y)z + (x + y)w
= xz + yz + xw + yw.
We can use FOIL to expand expressions such as (x + y)2 . If
x and y are numbers, then
(x + y)2 = (x + y)(x + y)
= x2 + xy + yx + y 2
= x2 + 2xy + y 2 .

71
72 A. ALGEBRA REVIEW

We can also expand (x − y)2 :


(x − y)2 = (x − y)(x − y)
= x2 − xy − yx + y 2
= x2 − 2xy + y 2 .
Let’s expand (x + y)3 :
(x + y)3 = (x + y)(x + y)2
= (x + y)(x2 + 2xy + y 2 )
= x(x2 + 2xy + y 2 ) + y(x2 + 2xy + y 2 )
= x3 + 2x2 y + xy 2 + yx2 + 2xy 2 + y 3
= x3 + 3x2 y + 3xy 2 + y 3 .

A.2. Difference of squares


The “difference of squares” formula states that if x and y
are numbers then
x2 − y 2 = (x − y)(x + y).
This rule can be proved by applying FOIL to simply the expression
on the right:
(x − y)(x + y) = x2 + xy − xy − y 2
= x2 − y 2 .
APPENDIX B

The equation of a line

Figure 16. Here is a line with slope m that passes


through the point (x0 , y0 ). If (x, y) is a point on
y−y0
this line, then m = riserun
= x−x 0
. In this example,
(x0 , y0 ) = (2, 2) and m = 2/5.

73

You might also like