0% found this document useful (0 votes)
18 views

CH 2

The document discusses mathematical foundations of machine learning including linear algebra concepts like vectors, matrices, least squares, inner products, norms, and more. Key topics covered include operations on vectors, word count vectors, linear functions, the Euclidean norm, root-mean-square values, triangle inequality, standard deviation, Cauchy-Schwarz inequality, correlation, and clustering objectives.

Uploaded by

reema alnafisi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

CH 2

The document discusses mathematical foundations of machine learning including linear algebra concepts like vectors, matrices, least squares, inner products, norms, and more. Key topics covered include operations on vectors, word count vectors, linear functions, the Euclidean norm, root-mean-square values, triangle inequality, standard deviation, Cauchy-Schwarz inequality, correlation, and clustering objectives.

Uploaded by

reema alnafisi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

AMCS 215

Mathematical Foundations of Machine Learning

Linear Algebra Foundations


Vectors, Matrices, Least Squares
Vectors

a vector is an ordered list of numbers


if a is a vector, ai is its ith entry
notation warning: ai can also refer to ith vector in a list of vectors
block vector
stacked concatenation of vectors

special vectors: 0, 1, ei
Examples

features: xi is the value of ith feature or attribute of an entity


customer purchase: xi is the total $ purchase of product i by a customer
over some period
portfolio: entries give shares (or $ value or fraction) held in each of n
assets, with negative meaning short positions
word count: xi is the number of times word i appears in a document
Word count vectors

a short document

a small dictionary (left) and word count vector (right)

dictionaries used in practice are much larger


Operations on vectors

addition: commutative, associative


subtraction
scalar-vector multiplication: associative, left distributive, right distributive
linear combinations
Inner product

inner product (or dot product) of two vectors a and b is

aT b = a1 b1 + a2 b2 + · · · + an bn

other notation used: < a, b >, a · b


properties
aT b = bT a
(αa)T b = α(aT b)
(a + b)T c = aT c + bT c
general examples
eTi a = ai , picks out ith entry
1T a, sum of entries
aT a, sum of squares of entries
Complexity

computers store (real) numbers in floating-point format


basic arithmetic operations (addition, multiplication, . . . ) are called
floating point operations or flops
complexity of an algorithm or operation: total number of flops needed, as
function of the input dimension(s)
this can be very grossly approximated
crude approximation of time to execute: (flops needed)/(computer speed)
current computers are around 10 Gflop/sec (1 Gflop/sec = 109 flop/sec)
but this can vary by factor of 100
Linear functions

f : Rn → R means a function mapping vectors to scalars


a function is called linear if it satisfies the superposition property

f (αx + βy) = αf (x) + βf (y) ∀α, β, x, y

the inner product function f (x) = aT x is linear


... and all linear functions are inner products
a function that is linear plus a constant is called affine
in linear regression, the affine function (of x)

ŷ = wT x + b = θx̃

is used to produce the scalar prediction ŷ of some actual outcome


Norm

the Euclidean norm (or just norm) of a vector x ∈ Rn is

∥x∥ = (x21 + · · · + x2n )1/2 = (xT x)1/2

used to measure the size of a vector


properties
homogeneity: ∥αx∥ = |α|∥x∥
triangle inequality: ∥x + y∥ ≤ ∥x∥ + ∥y∥
nonnegativity: ∥x∥ ≥ 0
definiteness: ∥x∥ = 0 only if x = 0
norm of block vectors

∥(a, b, c)∥ = ∥(∥a∥, ∥b∥, ∥c∥)∥


RMS value

mean-square value of a vector x is

x21 + · · · + x2n ∥x∥2


=
n n

root-mean-square value (RMS value) is



x21 + · · · + x2n ∥x∥
rms(x) = = √
n n

rms(x) gives “typical” value of |xi |


rms(1) = 1 (independent of n)
RMS value useful for comparing sizes of vectors of different lengths
Chebyshev inequality

suppose that k of the numbers |x1 |, . . . , |xn | are ≥ a


then k of the numbers x21 , . . . , x2n are ≥ a2
so ∥x∥2 =≥ ka2
so we have k ≤ ∥x∥2 /a2
number of xi with |xi | ≥ a is no more than ∥x∥2 /a2
this is the Chebyshev inequality
in terms of RMS value:
( )2
rms(x)
fraction of entries with |xi | ≥ a is no more than
a
example: no more than 4% of entries can satisfy |xi | ≥ 5 rms(x)
Triangle inequality

(Euclidean) distance between two vectors; dist(a, b) = ∥a − b∥


triangle with vertices at a, b, c
edge lengths are ∥a − b∥, ∥b − c∥, ∥a − c∥
by triangle inequality

∥a − c∥ = ∥(a − b) + (b − c)∥ ≤ ∥a − b∥ + ∥b − c∥

i.e., third edge length is no longer than sum of other two


Feature distance and nearest neighbors

if x and y are feature vectors, ∥x − y∥ is the feature distance


if z1 , . . . , zm is a list of vectors, zj is the nearest neighbor of x if

∥x − zj ∥ ≤ ∥x − zi ∥, i = 1, . . . , m

these simple ideas are very widely used


Document dissimilarity

5 Wikipedia articles: ‘Veterans Day’, ‘Memorial Day’, ‘Academy Awards’,


‘Golden Globe Awards’, ‘Super Bowl’
word count histograms, dictionary of 4423 words
pairwise distances shown below
Standard deviation

avg(x) = 1T x/n
de-meaned vector is x̃ = x − avg(x)1 (so avg(x̃) = 0)
standard deviation of x
∥x − (1T x/n)1∥
std(x) = rms(x̃) = √
n

std(x) gives “typical amount” xi vary from avg(x)


std(x) = 0 only if x = α1 for some α
a basic formula
rms(x)2 = avg(x)2 + std(x)2

standardization (µ = 0, σ = 1, z-scores)
1
z= (x − avg(x)1)
std(x)
Chebyshev inequality for standard deviation

rough idea: most entries of x are not too far from the mean
by Chebyshev inequality, fraction of entries of x with

|xi − avg(x)| ≥ α std(x)

is no more than 1/α2 (for α > 1)


consider a time series that represents investment returns over different
periods, with mean 8% and standard deviation 3%,
loss (xi ≤ 0) can occur in no more than (3/8)2 = 14.1% of periods
Cauchy–Schwarz inequality

for two vectors a and b, |aT b| ≤ ∥a∥∥b∥


written out,

|a1 b1 + · · · + an bn | ≤ (a21 + · · · an )1/2 (b21 + · · · bn )1/2

now we can show the triangle inequality:

∥a + b∥2 = ∥a∥2 + 2aT b + ∥b∥2


≤ ∥a∥2 + 2∥a∥∥b∥ + ∥b∥2
= (∥a∥ + ∥b∥)2
Derivation of Cauchy-Schwartz inequality

clearly true if either a or b is zero


so assume α = ∥a∥ and β = ∥b∥ are nonzero
we have

0 ≤ ∥βa − αb∥2
= ∥βa∥2 − 2(βa)T (αb) + ∥αb∥2
= β 2 ∥a∥2 − 2βα(aT b) + α2 ∥b∥2
= 2∥a∥2 ∥b∥2 − 2∥a∥∥b∥(aT b)

divide by 2∥a∥∥b∥ to get aT b ≤ ∥a∥∥b∥


apply to −a, b to get other half of Cauchy–Schwarz inequality
Angle

angle between two nonzero vectors a, b defined as


( T )
a b
∠(a, b) = arccos
∥a∥∥b∥

∠(a, b) is the number in [0, π] that satisfies


aT b = ∥a∥∥b∥ cos(∠(a, b))

coincides with ordinary angle between vectors in 2-D and 3-D; measures
distance along sphere
Correlation coefficient

vectors a and b, and de-meaned vectors ã and b̃


correlation coefficient between a and b

ãT b̃
ρ=
∥ã∥∥b̃∥

ρ = cos(∠(a, b))
ρ = 0:
ρ > 0.8 (or so):
ρ < 0.8 (or so):

very roughly: highly correlated means ai and bi are typically both above
(below) their means together
Examples

highly correlated vectors


rainfall time series at nearby locations
daily returns of similar companies in same industry
word count vectors of closely related documents (e.g., same author, topic,
etc.)
sales of shoes and socks (at different locations or periods)
approximately uncorrelated vectors
unrelated vectors
audio signals
Clustering application

given N vectors, x1 , · · · , xN
goal: partition (cluster) into k groups
want vectors in the same group to be close to each other
Example settings

topic discovery and document classification


xi is word count histogram for document i
patient clustering
xi are patient attributes, test results, symptoms
customer market segmentation
xi is purchase history and other attributes of customer i
financial sectors
xi are vectors of financial attributes of company i
Clustering objective

Gj ⊂ {1, · · · , N } is group j, for j = 1, · · · , k


ci is group that xi is in: i ∈ Gci
group representatives: vectors z1 , · · · , zk
clustering objective is

1 ∑
N
J= ∥xi − zci ∥2
N i=1

mean square distance from vectors to associated representative


J small means good clustering
goal: choose clustering ci and representatives zj to minimize J
Partitioning the vectors given the representatives

suppose representatives z1 , . . . , zk are given


how do we assign the vectors to groups, i.e., choose c1 , · · · , cN

ci only appears in term ∥xi − zci ∥2 in J


to minimize over ci , choose ci so ∥xi − zci ∥2 = minj ∥xi − zj ∥2
i.e., assign each vector to its nearest representative
Choosing representatives given the partition

given the partition G1 , · · · , Gk , how do we choose representatives


z1 , · · · , zk to minimize J?
J splits into a sum of k sums, one for each zj :

J = J1 + · · · + Jk , Jj = (1/N ) ∥xi − zj ∥2
i∈Gj

so we choose zj to minimize mean square distance to the points in its


partition
this is the mean (or average or centroid) of the points in the partition:

zj = (1/|Gj |) xi
i∈Gj
k-means algorithm

alternate between updating the partition, then the representatives


a famous algorithm called k-means
objective J decreases in each step

given x1 , . . . , xN and z1 , . . . , zk
repeat
update partition: assign i to Gj , j = argminj ′ ∥xi − zj ′ ∥2

update centroids:. zj = G1j i∈Gj xi
until z1 , · · · , zk stop changing
1 def kmeans(x, k, maxiters = 30, tol = 1e-4):
2 N, d = x.shape
3 distances = np.zeros(N) # store dists to nearest repr
4 initial = np.random.choice(N, k, replace=False) # initial grp repr
5 reps = x[initial, :] # store representatives
6 assignment = np.zeros(N, dtype = np.int) # asst of vectors to grps
7 Jprev = np.infty
8 for iter in range(maxiters):
9 # for each x[i], find distance to nearest repr and group index
10 for i in range(N):
11 ci = np.argmin([la.norm(x[i] - reps[j]) for j in range(k)])
12 assignment[i] = ci
13 distances[i] = la.norm(x[i] - reps[ci])
14 # cluster j representative is average of points in cluster j
15 for j in range(k):
16 group = [i for i in range(N) if assignment[i] == j]
17 reps[j] = np.sum(x[group], axis=0) / len(group)
18 # Compute clustering objective
19 J = la.norm(distances)**2 / N
20 # convergence
21 if (iter > maxiters) or (np.abs(J - Jprev) < tol * J):
22 break
23 Jprev = J
24 return assignment, reps
Convergence of k-means algorithm

J goes down in each step, until the zj ’s stop changing


but (in general) the k-means algorithm does not find the partition that
minimizes J
k-means is a heuristic: it is not guaranteed to find the smallest possible
value of J
the final partition (and its value of J) can depend on the initial
representatives
common approach:
run k-means 10 times, with different (often random) initial representatives
take as final partition the one with the smallest value of J
Example
Iteration 1
Iteration 2
Iteration 3
Iteration 10
Final clustering
Convergence
Handwritten digit image set

MNIST images of handwritten digits


N = 60, 000 28 × 28 images, represented as 784-vectors xi
25 examples shown below
k-means image clustering

k = 20, run 20 times with different initial assignments


convergence shown below (including best and worst)
Group representatives, best clustering
Topic discovery

N = 500 Wikipedia articles, word count histograms with D = 4423


k = 9, run 20 times with different initial assignments
convergence shown below (including best and worst)
Topics discovered (clusters 1–3)

words with largest representative coefficients

titles of articles closest to cluster representative


Applications

core routine in unsupervised learning and exploratory data analysis


classification
recommendation engine
suppose xi gives the number of times has listened to or streamed each song
from a library of d songs
clustering the vectors reveals groups of users of similar musical taste
group representatives have a nice interpretation: (zj )i is the average of
times users in group j listened to song i
suggest to user i songs they haven’t listened to but others in their group
have listened to most often
for example, to recommend 5 songs to user i, we identify the cluster j they
belong to and then find the indices l with (xi )l = 0, with the 5 largest
values of (zj )l

guessing missing entries


Linear independence

a set of vectors {a1 , . . . , ak } is linearly dependent if

β1 a1 + · · · + βk ak = 0

holds for some β1 , . . . , βk , that are not all zero


equivalent to: at least one ai is a linear combination of the others
a set of vectors {a1 , . . . , ak } is linearly independent if

β1 a1 + · · · + βk ak = 0

holds only when β1 = · · · = βk = 0


example: the unit vectors e1 , . . . , en are linearly independent
Linear combinations of linearly independent vectors

suppose x is a linear combination of linearly independent vectors


a1 , . . . , ak :
x = β1 a1 + · · · + βk ak

the coefficients β1 , . . . , βk are unique, i.e., if

x = γ1 a1 + · · · + γk ak

then βi = γi for all i


Basis

a linearly independent set of n-vectors can have at most n elements;


put another way: any set of n + 1 or more vectors is linearly dependent
a set of n linearly independent vectors in Rn is called a basis
any vector b in Rn can be expressed as a linear combination of them
and these coefficients are unique
example: e1 , . . . , en is a basis, expansion of b is

b = b1 e1 + · · · + bn en
Orthonormal vectors

set of vectors a1 , , . . . , ak are mutually orthogonal is ai ⊥ aj for i ̸= j


they are normalized if ∥ai ∥ = 1 for i = 1, . . . , k
they are orthonormal if both hold
can be expressed using inner products as
{
T 1 i=j
ai aj =
0 i ̸= j

orthonormal sets of vectors are linearly independent


by independence-dimension inequality, must have k ≤ n
when k = n, a1 , . . . , an are an orthonormal basis
Orthonormal expansion

If a1 , . . . , an is an orthonormal basis, we have for any vector x

x = (aT1 x)a1 + . . . , (aTn x)an

called orthonormal expansion of x (in the orthonormal basis)


to verify formula, take inner product of both sides with ai
Gram–Schmidt (orthogonalization) algorithm

an algorithm to check if a1 , . . . , ak are linearly independent


appears in core routines of other algorithms

given vectors a1 , . . . , ak
for i = 1, . . . , k
1. orthogonalization: q̃i = ai − (q1T ai )q1 − · · · − (qi−1
T
ai )qi−1
2. test for linear dependence:. if q̃i = 0, quit
3.normalization:. qi = q̃i /∥qi ∥
until z1 , · · · , zk stop changing

if G–S does not stop early (in step 2), a1 , . . . , ak are linearly independent
if G–S stops early in iteration i = j, then aj is a linear combination of
a1 , . . . , aj−1 (so a1 , . . . , ak are linearly dependent)
Example
Analysis of Gram–Schmidt

we can show by induction that q1 , . . . , qi are orthonormal


ai is a linear combination of q1 , . . . , qi :

ai = ∥q̃i ∥qi + (q1T ai )q1 + · · · + (qi−1


T
ai )qi−1

qi is a linear combination of a1 , . . . , ai : by induction on i,


( )
qi = (1/∥q̃i ∥) ai − (q1T ai )q1 + − · · · − (qi−1
T
ai )qi−1

and (by induction assumption) each q1 , . . . , qi−1 is a linear combination


of a1 , . . . , ai−1
if G-S terminates in step j, aj is a linear combination of a1 , . . . , aj−1

aj = (q1T aj )q1 + · · · + (qj−1


T
aj )qj−1
Complexity of Gram–Schmidt

step 1 of iteration i requires i−1 inner products,

q1T ai , . . . , qi−1
T
ai

which costs (i − 1)(2n − 1) flops


2n(i − 1) flops to compute q̃i
3n flops to compute ∥q̃i ∥ and qi
total is

k
k(k − 1)
((4n − 1)(i − 1)) = (4n − 1) + 3nk ≈ 2nk 2
i=1
2
∑k
using the i=1 (i − 1) = k(k − 1)/2
in practice, we use a more numerically-stable variant, called modified
Gram-Schmidt
Matrices

an m × n matrix A is a array of m rows and n columns of numbers


Aij , Ap:q,r:s notation for accessing entries and blocks
block matrices
A maybe written as a block matrix of (column) vectors a1, ..., an
[ ]
A = a1 a2 · · · an

or as a block matrix with its row vectors b1 , . . . , bm


 
b1
 b2 
 
A= . 
 .. 
bm

special matrices: zero, identity, diagonal, triangular, square, sparse


Matrix uses

store tabular data:


image: Xij is pixel value in grayscale image
feature matrix: Xij is value of feature j for entity i
become tensors when every entity is stored as a matrix

represent operators f : Rn → Rm
map vectors in Rn to vectors in Rm
via matrix-vector multiplication operation
Operations

transpose
addition, subtraction, and scalar multiplication
matrix-vector multiplication u = Av
row interpretation, ui = bTi v
example: A1 is vector of row sums
column interpretation, u = Av = v1 a1 + v2 a2 + · · · vn an
linear combination of columns of A
example: Aej = aj
columns of A are linearly independent if Av implies v = 0
arithmetic complexity:
2mn flops
for sparse matrices, 2 nnz(A), twice the number of non-zeros in A
Selectors

am m × n selector matrix: each row is a unit vector transposed


 T 
ek 1
 .. 
A= . 
eTkm

multiplying by A selects entries of v


Av = (vk1 , vk2 , . . . , vkm )
example: the m × 2m matrix
 
1 0 0 0 ··· 0 0
0 0 1 0 ··· 0 0
 
A = . .. .. .. .. .. 
 .. . . . . .
0 0 1 0 ··· 1 0
down-samples by 2: if v is a 2m-vector, then v = Av = (v1 , v3 , . . . , v2m−1 )
other examples: permutation, image cropping
Norm

the Frobenius norm is defined as


 1/2

m ∑
n
∥A∥F =  A2ij 
i=1 j=1

agrees with vector norm when n = 1


satisfies norm properties

∥αA∥ = |α|∥A∥, ∥A + B∥ ≤ ∥A∥ + ∥B∥, ∥A∥ ≥ 0, ∥A∥ = 0 only if A = 0

distance between two matrices: ∥A − B∥


other matrix norms:
∥A∥2 , measures a matrix by its largest magnifying effect on vectors
Linear functions

f : Rn → Rm is linear if it satisfies superposition

f (αx + βy) = αf (x) + βf (y)

the matrix-vector multiplication function is linear

f (αx + βy) = A(αx + βy)


= A(αx) + A(βy)
= α(Ax) + β(Ay) = αf (x) + βf (y)

conversely, if f is linear, then

f (x) = f (x1 e1 + x2 e2 + · · · + xn en )
= x1 f (e1 ) + x2 f (e2 ) + · · · + xn f (en )
= Ax
[ ]
with A = f (e1 ) f (e2 ) · · · f (en )
Examples

reversal: f (x) = (xn , xn−1 , . . . , x1 )


 
0 ... 0 1
0 . . . 1 0
 
A = . .. .. 
 .. . . . . .
1 ... 0 0

running sum: f (x) = (x1 , x1 + x2 , x1 + x2 + x3 , . . . , x1 + · · · + xn )


 
1 0 ... 0 0
1 1 . . . 0 0
 
 
A =  ... ... . . . ... ... 
 
1 1 . . . 1 0
1 1 ... 1 1
Affine functions

f : Rn → Rm is affine it it is linear function plus constant

f (x) = Ax + b

can find A and b from f


[ ]
A = f (e1 ) − f (0) f (e2 ) − f (0) · · · f (en ) − f (0)
b = f (0)

affine functions are also called “linear”


in many applications, relations between vectors in Rn and vectors Rm are
approximated as linear or affine
Matrix multiplication

can multiply m × p matrix A and p × n matrix B to get C = AB


inner product interpretation
with aTi the rows of A, bj the columns of B
 
aT1 b1 aT1 b2 ··· aT1 bn
 aT2 b1 aT2 b2 ··· aT2 bn 
 
AB =  . .. .. 
 .. . . 
aTm b1 aTm b2 ··· aTm bn

outer product interpretation


with ai the columns of A, bT
j the rows of B

1 + · · · + ap bp
AB = a1 bT T
Matrix multiplication

Column interpretation
denote columns of B by bi
[ ]
B = b1 b2 ··· bn

then we have
[ ]
AB = A b1 b2 · · · bn
[ ]
= Ab1 Ab2 · · · Abn

so AB is batch multiply of A times columns of B


some properties
(AB)C = A(BC)
A(B + C) = AB + AC
(AB)T = B T AT
AB = BA does not hold in general
Complexity

to compute Cij = (AB)ij is inner product of vectors of size p


so total required flops is (mn)(2p) = 2mnp flops
multiplying two 1000 × 1000 matrices requires 2 billion flops
and can be done in a small fraction of a second on current computers
Gram–Schmidt in matrix notation

run Gram–Schmidt on columns a1 , . . . , an of m × n matrix A


if columns are linearly independent, get orthonormal q1 , . . . , qn
define m × n matrix Q with columns q1 , . . . , qk
QT Q = I
from Gram–Schmidt algorithm

ai = (q1T ai )q1 + · · · + (qi−1


T
ai )qi−1 + ∥q̃∥qi
= R1i q1 + · · · + Rii qi

with Rij = qiT aj for i < j and Rii = ∥q̃i ∥


defining Rij = 0 for i > j, we have A = QR
R is upper triangular, with positive diagonal entries
QR factorization

A = QR is called the QR factorization of A


factors satisfy QT Q = I, R upper triangular with positive diagonal entries
numpy.linalg.qr(A)
one of the four important matrix decompositions
the others being LU , QΛQ−1 , U ΣV T
Matrix inverse

if A is square and invertible, then its inverse is the matrix A−1 such that

AA−1 = A−1 A = I

some properties
(AB)−1 = B −1 A−1
(AT )−1 = (A−1 )T , sometimes denoted as A−T

more on pseudo-inverses and inverses of rectangular matrices later


Regression model setup

basic regression model: ŷ = fˆ(x) = b + wT x


x is a vector of features, ŷ is our prediction of the actual outcome y
by absorbing the bias term into the parameter vector θ = (b, w), and
redefining the feature vectors as (1, x1 , . . . , xn ), we can simplify notation
and write the regression model as ŷ = fˆ(x) = θT x

more generally, we can write

fˆ(x) = θ1 f1 (x) + · · · + θp fp (x)

where fi : Rn → R are feature mappings we choose


note that if we define f1 (x) = 1, f2 (x) = x1 , · · · , fp (x) = xn , we can back
the basic model above
“feature engineering” is the task of identifying appropriate feature maps

these are linear models in the parameters θ


Regression model setup

now suppose we have N examples x1 , . . . , xN and associated responses


y1 , . . . , y N
∑p
associated predictions are ŷi = j=1 θj fj (xi )
write as ŷ = Aθ
Aij = fj (xi ) is the jth (mapped) feature of the ith training vector
the rows of AN ×p are the (mapped) features of training vectors xi
for the basic model, this would be Xθ where X is the matrix with rows
x1 , . . . , xN , the training vectors
ŷ is a vector of N predictions

prediction error vector is e = y − ŷ


least squares data fitting
choose model parameters θ to minimize rms(e)
square loss
Least squares problem

N ∥Aθ − y∥2
1
choose θ to minimize mean square loss

θ̂ = argmin∥Aθ − y∥2
θ

θ̂ is a solution of the least squares problem if

∥Aθ̂ − y∥2 ≤ ∥Aθ − y∥2

for any vector θ


idea: θ̂ makes residual as small as possible
Solution of least squares problem

we make one assumption: A has linearly independent columns


this makes AT A non-singular
unique solution of least squares problem is

θ̂ = (AT A)−1 AT y

(AT A)−1 AT is called the pseudo-inverse of A


Derivation via calculus

define  2

m ∑n
f (θ) = ∥Aθ − y∥2 =  Aij θj − yi 
i=1 j=1

solution x̂ satisfies ∇f (θ̂) = 0


∂f
(θ̂) = ∇f (θ̂)k = 0, k = 1, . . . , n
∂θk

taking partial derivatives we get


 
∂f ∑m ∑ n
∇f (θ)k = (θ) = 2 Aij θj − yi ) (Aik )
∂θk i=1 j=1


m
= 2(AT )ki (Aθ − y)i
i=1
( )
= 2AT (Aθ − y) k
Derivation (ctd)

∇f (θ̂) = 2AT (Aθ̂ − y) = 0


so θ̂ satisfies normal equations (AT A)θ̂ = AT y
and therefore θ̂ = (AT A)−1 AT y

We can also get to the normal equations using matrix notation:

∥Aθ − y∥2 = (Aθ − y)T (Aθ − y)


= θT AT Aθ − y T Aθ − θT AT y + y T y
= θT AT Aθ − 2θT AT y + y T y

the gradient is then 2AT Aθ − 2AT y


setting it to zero gives us θ̂ and the solution of AT A θ̂ = AT y
Direct verification

let θ̂ = (AT A)−1 AT y, so AT (Aθ̂ − y) = 0


for any vector x we have

∥Aθ − y∥2 = ∥(Aθ − Aθ̂) + (Aθ̂) − y)∥2


= ∥A(θ − θ̂)∥2 + ∥Aθ̂ − y∥2 + 2(A(θ − θ̂))T (Aθ̂ − y)
= ∥A(θ − θ̂)∥2 + ∥Aθ̂ − y∥2 + 2(θ − θ̂)T AT (Aθ̂ − y)
= ∥A(θ − θ̂)∥2 + ∥Aθ̂ − y∥2

so for any θ, ∥Aθ − b∥2 ≥ ∥Aθ̂ − y∥2


if equality holds, A(θ − θ̂) = 0. which implies θ = θ̂ since columns of A
are linearly independent
Computing least squares approximate solutions

compute the QR factorization of A: A = QR


solution can be written as

θ̂ = (AT A)−1 AT y
= (RT QT QR)−1 RT QT y
= R−1 R−T RT QT y
= R−1 QT y

therefore to compute θ̂
form QT b
compute x̂ = R−1 (QT y) via back substitution
Complexity of least squares solution

QR factorization (2mn2 flops)


form QT b (2mn flops)
back substitution (n2 flops)
total complexity 2mn2 , O(mn2 ) flops
main cost is the factorization
cost of multiple right hand sides is essentially the cost of a single one
Example: Polynomial fit

fi (x) = xi−1 , i = 1, . . . , p
model is a polynomial of degree than p

fˆ(x) = θ1 + θ2 x + · · · + θp xp−1

(xi means a scalar raised to the power i)


A is the matrix  
1 x1 ··· xp−1
1
1 x2 ··· xp−1 
 2 
A = . .. .. 
 .. . . 
1 xN ··· xp−1
N
Generalization

basic idea:
goal of model is not to predict outcome for the given data
instead it is to predict the outcome on new, unseen data

a model that makes reasonable predictions on new, unseen data has


generalization ability, or generalizes
a model that makes poor predictions on new, unseen data is said to suffer
from over-fit
Out-of-sample validation

a simple and effective method to guess if a model will generalize


split original data into a training set and a test set
typical splits: 80%/20%, 90%/10%
train model on training data set
then check the model’s predictions on the test data set
can also compare RMS prediction error on train and test data
if they are similar, we can guess the model will generalize
Out-of-sample validation

can be used to choose among different candidate models, e.g.


polynomials of different degrees
regression models with different sets of features
we would use one with low, or lowest, test error
Example
models fit using training set of 100 points; plots show test set of 100 points
RMS on test data

degree 4, 5, or 6 are reasonable choices here


Cross validation

extension of out-of-sample validation to get more confidence in the


generalization ability of a model
to carry out cross validation:
divide data into 5 (or 10) folds
for i = 1, . . . , 5 build (train) model using all folds except i
test model on data in fold i
interpreting cross validation results:
if test RMS errors are much larger than train RMS errors, model is
over-fit
if test and train RMS errors are similar and consistent, we can guess the
model will have a similar RMS error on future data
√∑
RMS cross validation error is i ϵ /nfolds where ϵi is the RMS
2

prediction error on fold i


Example

house price, regression fit with x = (area/1000 ft2 , bedrooms)


774 sales, divided into 5 folds of 155 sales each
fit 5 regression models, removing each fold
Feature engineering

start with original or base feature vector x


choose mapping functions f1 , . . . , fp to create “mapped” feature vector

f1 (x), . . . , fp (x)

now fit linear in parameters model with mapped features

ŷ = θ1 f1 (x) + · · · + fp (x)

check the model using validation


Transforming features

standardizing features: replace jth feature xj with

(xj − bj )/aj

bj , mean value of the feature across the data


aj , standard deviation of the feature across the data

log transform: if xj is nonnegative and spans a wide range, replace it with

log(1 + xj )

clipping values
Creating new features

expanding categoricals
features that take on only few values, e.g., booleans, Likert scale, day of
week
one-hot encoding: expand a categorial feature with l values into l − 1
features that encode whether the feature has one of the (non-default) values
example: bedrooms in the house price prediction problem

hi and lo features: create new features given by

max{xj − b, 0}, min{xj + a, 0}

amount by which feature xj is below −a, or above b


products and interactions
from the original features we can add xj xk for j, k = 1, . . . , n, j ≤ k
products are used to model interactions among the features
More feature generation methods

custom mappings, often involving domain-specific information, e.g., P/E


ratio in financial application, TFIDF in document analysis, etc.
prediction from other models
distance to cluster representatives
random features
generate a K × n matrix R and generate new features as (Rx)+ , or other
nonlinear functions on elements of Rx
a bit counterintuitive but can be very effective in some applications
neural network features
can find good feature mappings directly from the data
useful when lots of data is available, more on this soon

but beware: adding new features can easily lead to over-fit. Keep the
model simple. Validate the model.
Example

house price prediction


start with base features
x1 is area of house (in 1000ft2 )
x2 is number of bedrooms
x3 is 1 for condo, 0 for house
x4 is zip code of address (62 values)
use 8 mapped features
f1 (x) = 1, f2 (x) = x1 , f3 (x) = max{x1 − 1.5, 0}
f4 (x) = x2 , f5 (x) = x3
f6 (x), f7 (x), f8 (x) are Boolean functions of x4 which encode 4 groups of
nearby zip codes (i.e., neighborhood)

5-fold model validation


Example
Application to classification

data fitting with outcome that takes on (non-numerical) values, labels


true or false
spam or not spam
dog, horse, or mouse

classifier has the form ŷ = fˆ(x), f : Rn → {−1, +1} for Boolean or


2-way classification
can be done via least squares
very simple methods
but we will soon look at logistic regression, support vector machines, and
other more sophisticated methods
Prediction errors

data point (x, y), predicted outcome ŷ = fˆ(x)


only four possibilities
True positive, y = +1 and ŷ = +1
True negative, y = −1 and ŷ = −1
(in these two cases, the prediction is correct)
False positive, y = −1 and ŷ = +1
False negative, y = +1 and ŷ = −1
Confusion matrix

given data set x1 , . . . , xN , y1 , . . . , yN , and classifier fˆ


count each of the four outcomes

off-diagonal terms are prediction errors


many error rates and accuracy measures are used
error rate (Nf p + Ntp )/N
true positive (or recall) rate, Ntp /Np
false positive (or false alarm) rate, Nf p /Nn
or equivalently, true negative rate (or specificity) rate, Ntn /Nn
precision, Ntp /(Ntp + Nf p )

out-of sample validation and cross validation


Example

spam filter performance on a test set

error (19 + 32)/1266 = 4.03%


false positive rate is 19/1139 = 1.67%
Least squares classification

fit model f˜ to encoded (±1)yi values using standard least squares data
fitting
f˜ should be near +1 when y = +1, and near −1 when y = −1
f˜ is a number
use model fˆ(xi ) = sign(f˜(xi ))
(size of f˜(xi ) is related to the “confidence” in the prediction)
Handwritten digits example

MNIST data set of 70000 28 × 28 images of digits 0, . . . , 9


divided into training set (60000) and test set (10000)
every training image (xi ) is a vector of 494 pixel values (ignore pixels
that are zeros in all images)

y = +1 if digit is 0; −1 otherwise
Least squares classifier results

training set results (error rate 1.6%)

test set results (error rate 1.6%)

we can likely achieve 1.6% error rate on unseen images


Distribution of least squares fit

distribution of values of f˜(xi ) over training set


Coefficients in least squares classifier
Skewed decision threshold

use predictor fˆ(xi ) = sign(f˜(xi ) − α), i.e.,


{
+1 f˜(x) ≥ α
fˆ(x) =
−1 f˜(x) < α

α is the decision threshold


for positive α, false positive rate is lower but so is true positive rate
for negative α, false positive rate is higher but so is true positive rate
trade off curve of true positive versus false positive rates is called receiver
operating characteristic (ROC)
Example
ROC curve
Multi-class classifiers

we have K > 2 possible labels, with label set {1, . . . , K}


predictor is fˆ : Rn → {1, . . . , K}
for given predictor and data set, confusion matrix is K × K
some off-diagonal entries may be much worse than others
Examples

handwritten digit classification, guess the digit written, from the pixel
values
marketing demographic classification, guess the demographic group, from
purchase history
disease diagnosis, guess diagnosis from among a set of candidates, from
test results, patient features
translation word choice, choose how to translate a word into several
choices, given context features
document topic prediction, guess topic from word count histogram
Least squares multi-class classifier

create a least squares classifier for each label versus the others
take as classifier
fˆ(x) = argmax f˜ℓ (x)
ℓ∈{1,...,K}

(i.e., choose ℓ with largest value of f˜ℓ (x))


for example, with

f˜1 (x) = −0.7, f˜2 (x) = +0.2, f˜3 (x) = +0.8

we choose fˆ(x) = 3
Handwritten digit classification

confusion matrix, test set

error rate is around 14% (same as for training set)


Adding new features

let’s add 5000 random features (!), max{(Rx)j , 0}


R is 5000 × 494 matrix with entries ±1, chosen randomly
now use least squares classification with 5494 feature vector
results: training set error 1.5%, test set error 2.6%
can do better with a little more thought in generating new features
Results with new features

confusion matrix, test set


Convolution

for a ∈ Rn and b ∈ Rm , the convolution c = a ∗ b is the (n + m − 1)


vector ∑
ck = ai bj , k = 1, . . . , n + m − 1
i+j=k+1

for example with n = 4, m = 3, we have

c1 = a1 b1
c2 = a1 b2 + a2 b1
c3 = a1 b + 3 + a2 b2 + a3 b1
c4 = a2 b3 + a3 b2 + a4 b1
c 5 = a 3 b3 + a 4 b2
c 6 = a 4 b3

(1, 0, −1) ∗ (2, 1, −1) = (2, 1, −3, −1, 1)


Polynomial multiplication

a and b are coefficients of two polynomials

p(x) = a1 + a2 x + · · · + an xn−1 , q(x) = b1 + b2 x + · · · + bm xm−1

convolution c = a ∗ b gives the coefficients of the product p(x)q(x)

p(x)q(x) = c1 + c2 x + · · · + cn+m−1 xn+m−1

this gives simple proofs of many properties of convolution; for example,


a∗b=b∗a
(a ∗ b) ∗ c = a ∗ (b ∗ c)
a ∗ b = 0 only if a = 0 or b = 0
Toeplitz matrices

can express c = a ∗ b using matrix-vector multiplication as c = T (b)a,


with  
b1 0 0 0
b2 b1 0 0
 
b3 b2 b1 0 
T (b) = 
 0 b3

 b2 b1 
0 0 b3 b2 
0 0 0 b3

T (b) is a Toeplitz matrix (values on diagonals are equal)


Moving average of time series

vector x represents a time series


convolution y = a ∗ x with a = (1/3, 1/3, 1/3) is 3-period moving average

yk = 13 (xk + xk−1 + xk−2 ), k = 1, 2, . . . , n + 2

(with xk interpreted as zero for k < 1 and k > n)


Matrix inverse

if A is square and invertible, then its inverse is the matrix A−1 such that

AA−1 = A−1 A = I

some properties
(AB)−1 = B −1 A−1
(AT )−1 = (A−1 )T , sometimes denoted as A−T
Left inverse

a matrix X that satisfies XA = I is called a left inverse of A


if a left inverse exists we say that A is left-invertible
example: the matrix  
−3 −4
A= 4 6
1 1
has two different left inverses:
[ ] [ ]
1 −11 −10 16 1 0 −1 6
B= , C=
9 7 8 −11 2 0 1 −4
Left inverse and column independence

if A has a left inverse C, then the columns of A are linearly independent


to see this: of Ax = 0 and CA = I then

0 = C0 = C(Ax) = (CA)x = Ix = x

converse is also true


a matrix is left-invertible if and only if its columns are linearly independent

matrix generalization of: a number is invertible if and only if it is nonzero


so left-invertible matrices are tall or square
Right inverse

similarly, right inverses can be defined for wide or square matrices


a matrix A is right-invertible if and only if AT is left-invertible
a matrix is right-invertible if and only if its rows are linearly independent
Linear equations

Ax = b

for relatively small matrices, the methods of choice are factor-solve


methods
a matrix factorization is first computed, independently of the rhs b
and then used to compute x

we write the solution as x = A−1 b, but we almost never generate an


inverse
np.linalg.solve(A, b)
Linear equations

triangular decomposition for general square matrices A = LU


  
1 u11 u12 u13 ··· u1n
 l21 1  u22 u23 ··· u2n 
  
 l31 l32 1  u33 ··· u3n 
  
 .. .. .
. . .  .. .. 
 . . . .  . . 
ln1 ln2 ln3 · · · 1 unn

2 3
3n flops
for SPD matrices, A = LLT (or RT R)
the Cholesky decomposition
2 3
3
n flops
available as np.linalg.cholesky(A)
Forward/backward substitutions for triangular systems

(LU )x = b solved in two steps


Ly = b
Ux = b
linear systems with triangular coefficient matrices
n2 flops
Multiple right hand sides

AX = B

let’s solve Axi = bi , i = 1, . . . , k, with A invertible


carry out LU factorization once ( 23 n3 )
for i = 1, . . . , k, solve LU x = bi via forward/back substitutions (2kn2 )
total is 23 n3 + 2kn2
if k is small compared to n, same cost as solving one set of equations!
Singular value decomposition (SVD)

For tall matrices,


T
Am×n = Um×n Σn×n Vn×n

U and V are orthonormal matrices


Σ is a diagonal matrix with positive decreasing values, called singular
values
truncated SVD allows us to generate low rank approximations of data
matrices

k
T
Âk = U1:k Σ1:k,1:k V1:k = σi Ui ViT
i=1

(∑n )1/2
∥A − Âk ∥F ≤ i=k+1 σi2

∥A − Âk ∥2 ≤ σk+1
best low rank approximation
Singular value decomposition (SVD)
Eigenvalue decomposition (EVD)

For real symmetric matrices,


n
A = QΛQT = λi qi qiT
i=1

Q is orthonormal
Λ is a diagonal matrix of eigenvalues
positive definite (SPD) matrices have all positive eigenvalues,
xT Ax > 0 ∀x ̸= 0

A−1 = QΛ−1 QT

You might also like