calcForMachineLearning
calcForMachineLearning
Daniel O’Connor
Contents
Preface vii
Part 1. Calculus 1
vii
Part 1
Calculus
CHAPTER 1
Rate of change
that f (30) = 120, and suppose also that two seconds later the
odometer reports a total distance traveled of 160 meters, so
that f (32) = 160. Then the car has traveled f (32) − f (30) =
160 − 120 = 40 meters during the two second time interval from
time t0 = 30 to time t = 32. The average velocity of the car
during this time interval is (160 − 120)/(32 − 30) = 40/2 = 20
meters per second.
Now imagine that the time interval from t0 to t is very short,
so that time t is only a split second later than time t0 . Then
the average velocity during the time interval from t0 to t is a
good approximation to the instantaneous velocity of the car at
time t0 . Even better approximations can be obtained by taking
t to be closer and closer to t0 . In fact, we can approximate the
instantaneous velocity as closely as we like by taking t to be
sufficiently close to t0 . We express this fact concisely by saying
that the instantaneous velocity at time t0 is equal to the limit
as t approaches t0 of the average velocity (f (t) − f (t0 ))/(t − t0 ).
To save writing, the instantaneous velocity at time t0 is denoted
f 0 (t0 ). In summary:
f (t) − f (t0 )
f 0 (t0 ) = lim . (1.1)
t→t0 t − t0
∆f
f 0 (t0 ) ≈ ,
∆t
1. RATE OF CHANGE 5
f (t)−f (t0 )
t f (t) t−t0
f (x) − f (x0 )
f 0 (x0 ) = lim . (1.2)
x→x0 x − x0
x 6= x0 then
f (x) − f (x0 ) x2 − x20
=
x − x0 x − x0
(x − x0 )(x + x0 )
=
x − x0
= x + x0 .
Clearly, x + x0 approaches x0 + x0 = 2x0 as x approaches x0 .
This shows that
f 0 (x0 ) = 2x0 . (1.3)
0
For example, f (2) = 2 · 2 = 4. This confirms the result that we
observed in table 1.
One of the main goals of a calculus course is to derive a large
number of rules like this for computing the derivatives of specific
functions. We will derive more such rules in section 2.
Exercise 1.2. Derive formulas for the derivatives of the
functions x, x3 , and 1/x.
The quantity on the right is none other than the derivative f 0 (x0 ).
We have found our geometric interpretation of the derivative:
approaches 0:
f (x) − f (x0 )
lim = 0.
x%x0 x − x0
The fact that the slope of the secant line approaches different
values depending on whether x approaches x0 from the right or
from the left means that in this example f (x)−f (x0 )
x−x0 simply does
not have a unique limit as x approaches x0 :
f (x) − f (x0 )
lim does not exist.
x→x0 x − x0
The function f does not have a derivative at 0.
Imagine that a car is driving along at 10 meters per second,
and then at time t = 30 the car’s velocity magically jumps to 20
meters per second (perhaps due to a glitch in the matrix). What
is the car’s instantaneous velocity at time t = 30? There is no
correct answer. Both values 10 meters per second and 20 meters
per second would be equally valid. The car simply does not have
a velocity at that instant in time.
When we make the statement
f (x) − f (x0 )
lim = L,
x→x0 x − x0
we insist that f (x)−f
x−x0
(x0 )
approaches the same limiting value L
no matter what path x follows as x approaches x0 . If that is not
the case, then the statement is not true, and f 0 (x0 ) simply does
not exist.
Before we compute the derivative of a function, we should be
careful to first check that the derivative exists. But don’t worry,
most functions we encounter have perfectly smooth graphs, with
no sharp corners where the tangent line is not well defined.
Here is a bit more terminology. The process of computing
the derivative of a function is called “differentiation”. If f has
a derivative at x0 , then f is said to be “differentiable” at x0 .
We have seen in this section that the ramp function is not dif-
ferentiable at 0. However, the ramp function is differentiable
everywhere else.
1.3. SOMETIMES THE DERIVATIVE DOES NOT EXIST 11
In words:
Example 2.1. If
f (x) = 1 + x2
↑ ↑
g(x) h(x)
then
f 0 (x0 ) = 0 + 2x0 = 2x0 .
↑ ↑
0
g (x0 ) h0 (x0 )
It follows that
f 0 (x0 ) = cg 0 (x0 ).
Example 2.2. If
c
↓
f (x) = 5 x2 ,
↑
g(x)
then
c
↓
It follows that
g(x)h(x) − g(x0 )h(x0 )
≈ g 0 (x0 )h(x0 ) + g(x0 )h0 (x0 )
x − x0
+ g 0 (x0 )h0 (x0 )(x − x0 ) .
| {z }
approaches 0
Exercise 2.4. Use the product rule and the result of the
previous exercise to compute the derivative of the function f (x) =
x4 . Conjecture a formula for the derivative of f (x) = xn , where
n is a nonnegative integer.
xm
0 · 0 − 1 · (mxm−1 )
f 0 (x0 ) = m 2
0
= −mx0−m−1 .
(x0 )
↑
h(x0 )2
This shows that the formula (2.2) holds also when n is negative.
18 2. FORMULAS FOR THE DERIVATIVE
2.6. Derivative of bx
Suppose that f (x) = bx , where b > 1 is a number. If x 6= x0 ,
then
bx−x0 − 1
0 f (x) − f (x0 )
f (x0 ) = lim = bx0 lim .
x→x0 x − x0 x→x0 x − x0
| {z }
annoying number
∆x (1 + ∆x)1/∆x
.1 2.5937424
.01 2.7048138
.001 2.7169239
.0001 2.7181459
Table 1. As ∆x approaches 0, the quantity (1 +
∆x)1/∆x approaches e ≈ 2.718.
f 0 (x0 ) = ex0 .
ex0 1
S 0 (x0 ) = · = S(x0 )(1 − S(x0 )).
1 + ex0 1 + ex0
This formula is convenient because it shows that if we have
already evaluated S(x0 ), then hardly any additional arithmetic
operations are required to evaluate S 0 (x0 ). We can compute
S 0 (x0 ) very efficiently.
ef (x) f 0 (x) = 1
=⇒ xf 0 (x) = 1
=⇒ f 0 (x) = x1 .
2.8. THE POWER RULE 23
Simplifying, we obtain
√ 1
50 ≈ 7 + ≈ 7.07142.
14
√
The true value of 50 is 7.07106 . . ., so the approximation is not
bad.
Part 2
Richard Feynman
follows:
hx, yi = x1 y1 + x2 y2 + · · · + xn yn . (3.1)
x is orthogonal to y ⇐⇒ hx, yi = 0.
to be as small as possible.
This example shows how a goal of making accurate predic-
tions naturally leads us to the goal of minimizing an error function
such as f that takes a list of numbers as input. Many of the most
popular machine learning algorithms, such as neural networks,
are variations of this simple, classic idea.
the temperature is
change in temperature f (x + tu) − f (x) degrees
= .
distance traveled t meter
If we select values of t that are shorter and shorter, and approach-
ing 0, this average rate of change approaches the instantaneous
rate of change of f in the direction u. This instantaneous rate of
change is denoted Du f (x):
f (x + tu) − f (x)
Du f (x) = lim . (4.2)
t→0 t
The number Du f (x) is called the “directional derivative” of f at
x in the direction u.
Exercise 4.2. Let f : R2 → R be defined by
f (x1 , x2 ) = x1 x2 .
√ √
Compute Du f (3, 2), where u is the unit vector (1/ 2, 1/ 2).
Solution:
√ √
(3 + t/ 2)(2 + t/ 2) − 6
Du f (3, 2) = lim
t→0 t
√
6 + 5t/ 2 + t2 /2 − 6
= lim
t→0 t
√
= lim 5/ 2 + t/2
t→0
√
= 5/ 2.
and
f (x1 + t, x2 , . . . , xn ) − f (x1 , x2 , . . . , xn )
Du f (x) = lim . (4.3)
t→0 t
The quantity on the right is the derivative of the function g :
R → R defined by
g(x1 ) = f (x1 , x2 , . . . , xn ). (4.4)
(When defining g, we are thinking of the numbers x2 , . . . , xn as
being held fixed, whereas x1 can be any real number.) In other
words,
Du f (x) = g 0 (x1 ).
This is nice because g is a function of a single variable, which
means that g 0 (x1 ) can be computed using the arsenal of formulas
for derivatives that we derived in chapter 2.
When u = (1, 0, . . . , 0), an alternative notation for the di-
rectional derivative Du f (x) is D1 f (x). Likewise, in the special
case where the vector u has a 1 in the ith position and zeros
elsewhere, an alternative notation for the directional derivative
Du f (x) is Di f (x). These numbers Di f (x) (for i = 1, . . . , n) are
called the partial derivatives of f at x. Computing Di f (x) is
easy for the same reason that computing D1 f (x) is easy:
Note that another very common notation for Di f (x) that you
will see in other books is ∂f∂x(x)
i
.
where ∆f1 is the amount that f changes when its input changes
from point A to point B = (x1 + ∆x1 , x2 ), and ∆f2 is the amount
that f changes when its input changes from point B to point C.
This is illustrated in figure 14.
According to equation (4.5), if ∆x1 and ∆x2 are small num-
bers then
∆f1 ≈ D1 f (x1 , x2 )∆x1
and
∆f2 ≈ D2 f (x1 + ∆x1 , x2 )∆x2
≈ D2 f (x1 , x2 )∆x2 .
Putting these pieces together, we find that
∆f ≈ D1 f (x1 , x2 )∆x1 + D2 f (x1 , x2 )∆x2 . (4.6)
In words, as you move from A to B to C, the value of f changes
first by D1 f (x)∆x1 and then by D2 f (x)∆x2 .
The expression on the right in (4.6) looks like the dot product
of two vectors. This suggests that equation (4.6) can be written
4.6. THE DIRECTION OF STEEPEST ASCENT 47
instead of horizontally:
f1 (x + ∆x)
f (x + ∆x) = f2 (x + ∆x)
f3 (x + ∆x)
f1 (x) + D1 f1 (x)∆x1 + D2 f1 (x)∆x2 + D3 f1 (x)∆x3
≈ f2 (x) + D1 f2 (x)∆x1 + D2 f2 (x)∆x2 + D3 f2 (x)∆x3
f3 (x) + D1 f3 (x)∆x1 + D2 f3 (x)∆x2 + D3 f3 (x)∆x3
f1 (x) D1 f1 (x)∆x1 + D2 f1 (x)∆x2 + D3 f1 (x)∆x3
= f2 (x) + D1 f2 (x)∆x1 + D2 f2 (x)∆x2 + D3 f2 (x)∆x3 .
f3 (x) D1 f3 (x)∆x1 + D2 f3 (x)∆x2 + D3 f3 (x)∆x3
| {z } | {z }
f (x) Need concise notation for this.
Matrix multiplication
is “multiplied” by a vector
x1
x2
x = . ∈ Rn
..
xn
53
54 6. MATRIX MULTIPLICATION
Ax = x1 a1 + x2 a2 + · · · + xn an . (6.3)
B(Ax) = M x
C(BA) = (CB)A,
(Ax)T = xT AT .
hAx, yi = (Ax)T y
= (xT AT )y
= xT (AT y)
= hx, AT yi.
(BA)T = AT B T .
62 6. MATRIX MULTIPLICATION
Note that we can only add together matrices which have the
same shape. We could not, for example, add a row vector to a
column vector.
Exercise 6.7. Suppose that x ∈ Rn . Explain why
(A + B)x = Ax + Bx.
Solution: It’s probably more clear to check this for yourself than
to read my explanation. Nevertheless, let aTi and bTi be the ith
rows of A and B, respectively. Then the ith row of A + B is
aTi + bTi , and the ith entry of (A + B)x is (aTi + bTi )x = aTi x + bTi x.
But this is the sum of the ith entries of Ax and Bx.
Exercise 6.8. Suppose C is an n × k matrix. Explain why
(A + B)C = AB + AC.
We will now discover one of the most useful rules for comput-
ing derivatives in multivariable calculus. Suppose that h : Rn →
Rm and g : Rm → Rk , and suppose that
f (x) = g(h(x))
for all x ∈ R . Notice that f takes as input a point in Rn and
n
Minimizing a function
f (x)
Algebra review
A.1. FOIL
The FOIL (“first-outer-inner-last”) formula states that if
x, y, z and w are numbers then
(x + y)(z + w) = xz + xw + yz + yw.
71
72 A. ALGEBRA REVIEW
73