0% found this document useful (0 votes)

18 views

CH 2

The document discusses mathematical foundations of machine learning including linear algebra concepts like vectors, matrices, least squares, inner products, norms, and more. Key topics covered include operations on vectors, word count vectors, linear functions, the Euclidean norm, root-mean-square values, triangle inequality, standard deviation, Cauchy-Schwarz inequality, correlation, and clustering objectives.

Uploaded by

reema alnafisi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

CH 2

Uploaded by

reema alnafisi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 121

AMCS 215

Mathematical Foundations of Machine Learning

Linear Algebra Foundations

Vectors, Matrices, Least Squares
Vectors

a vector is an ordered list of numbers

if a is a vector, ai is its ith entry
notation warning: ai can also refer to ith vector in a list of vectors
block vector
stacked concatenation of vectors

special vectors: 0, 1, ei
Examples

features: xi is the value of ith feature or attribute of an entity

customer purchase: xi is the total $ purchase of product i by a customer
over some period
portfolio: entries give shares (or $ value or fraction) held in each of n
assets, with negative meaning short positions
word count: xi is the number of times word i appears in a document
Word count vectors

a short document

a small dictionary (left) and word count vector (right)

dictionaries used in practice are much larger

Operations on vectors

addition: commutative, associative

subtraction
scalar-vector multiplication: associative, left distributive, right distributive
linear combinations
Inner product

inner product (or dot product) of two vectors a and b is

aT b = a1 b1 + a2 b2 + · · · + an bn

other notation used: < a, b >, a · b

properties
aT b = bT a
(αa)T b = α(aT b)
(a + b)T c = aT c + bT c
general examples
eTi a = ai , picks out ith entry
1T a, sum of entries
aT a, sum of squares of entries
Complexity

computers store (real) numbers in floating-point format

basic arithmetic operations (addition, multiplication, . . . ) are called
floating point operations or flops
complexity of an algorithm or operation: total number of flops needed, as
function of the input dimension(s)
this can be very grossly approximated
crude approximation of time to execute: (flops needed)/(computer speed)
current computers are around 10 Gflop/sec (1 Gflop/sec = 109 flop/sec)
but this can vary by factor of 100
Linear functions

f : Rn → R means a function mapping vectors to scalars

a function is called linear if it satisfies the superposition property

f (αx + βy) = αf (x) + βf (y) ∀α, β, x, y

the inner product function f (x) = aT x is linear

... and all linear functions are inner products
a function that is linear plus a constant is called affine
in linear regression, the affine function (of x)

ŷ = wT x + b = θx̃

is used to produce the scalar prediction ŷ of some actual outcome

Norm

the Euclidean norm (or just norm) of a vector x ∈ Rn is

∥x∥ = (x21 + · · · + x2n )1/2 = (xT x)1/2

used to measure the size of a vector

properties
homogeneity: ∥αx∥ = |α|∥x∥
triangle inequality: ∥x + y∥ ≤ ∥x∥ + ∥y∥
nonnegativity: ∥x∥ ≥ 0
definiteness: ∥x∥ = 0 only if x = 0
norm of block vectors

∥(a, b, c)∥ = ∥(∥a∥, ∥b∥, ∥c∥)∥

RMS value

mean-square value of a vector x is

x21 + · · · + x2n ∥x∥2

=
n n

root-mean-square value (RMS value) is

√
x21 + · · · + x2n ∥x∥
rms(x) = = √
n n

rms(x) gives “typical” value of |xi |

rms(1) = 1 (independent of n)
RMS value useful for comparing sizes of vectors of different lengths
Chebyshev inequality

suppose that k of the numbers |x1 |, . . . , |xn | are ≥ a

then k of the numbers x21 , . . . , x2n are ≥ a2
so ∥x∥2 =≥ ka2
so we have k ≤ ∥x∥2 /a2
number of xi with |xi | ≥ a is no more than ∥x∥2 /a2
this is the Chebyshev inequality
in terms of RMS value:
( )2
rms(x)
fraction of entries with |xi | ≥ a is no more than
a
example: no more than 4% of entries can satisfy |xi | ≥ 5 rms(x)
Triangle inequality

(Euclidean) distance between two vectors; dist(a, b) = ∥a − b∥

triangle with vertices at a, b, c
edge lengths are ∥a − b∥, ∥b − c∥, ∥a − c∥
by triangle inequality

∥a − c∥ = ∥(a − b) + (b − c)∥ ≤ ∥a − b∥ + ∥b − c∥

i.e., third edge length is no longer than sum of other two

Feature distance and nearest neighbors

if x and y are feature vectors, ∥x − y∥ is the feature distance

if z1 , . . . , zm is a list of vectors, zj is the nearest neighbor of x if

∥x − zj ∥ ≤ ∥x − zi ∥, i = 1, . . . , m

these simple ideas are very widely used

Document dissimilarity

5 Wikipedia articles: ‘Veterans Day’, ‘Memorial Day’, ‘Academy Awards’,

‘Golden Globe Awards’, ‘Super Bowl’
word count histograms, dictionary of 4423 words
pairwise distances shown below
Standard deviation

avg(x) = 1T x/n
de-meaned vector is x̃ = x − avg(x)1 (so avg(x̃) = 0)
standard deviation of x
∥x − (1T x/n)1∥
std(x) = rms(x̃) = √
n

std(x) gives “typical amount” xi vary from avg(x)

std(x) = 0 only if x = α1 for some α
a basic formula
rms(x)2 = avg(x)2 + std(x)2

standardization (µ = 0, σ = 1, z-scores)
1
z= (x − avg(x)1)
std(x)
Chebyshev inequality for standard deviation

rough idea: most entries of x are not too far from the mean
by Chebyshev inequality, fraction of entries of x with

|xi − avg(x)| ≥ α std(x)

is no more than 1/α2 (for α > 1)

consider a time series that represents investment returns over different
periods, with mean 8% and standard deviation 3%,
loss (xi ≤ 0) can occur in no more than (3/8)2 = 14.1% of periods
Cauchy–Schwarz inequality

for two vectors a and b, |aT b| ≤ ∥a∥∥b∥

written out,

|a1 b1 + · · · + an bn | ≤ (a21 + · · · an )1/2 (b21 + · · · bn )1/2

now we can show the triangle inequality:

∥a + b∥2 = ∥a∥2 + 2aT b + ∥b∥2

≤ ∥a∥2 + 2∥a∥∥b∥ + ∥b∥2
= (∥a∥ + ∥b∥)2
Derivation of Cauchy-Schwartz inequality

clearly true if either a or b is zero

so assume α = ∥a∥ and β = ∥b∥ are nonzero
we have

0 ≤ ∥βa − αb∥2
= ∥βa∥2 − 2(βa)T (αb) + ∥αb∥2
= β 2 ∥a∥2 − 2βα(aT b) + α2 ∥b∥2
= 2∥a∥2 ∥b∥2 − 2∥a∥∥b∥(aT b)

divide by 2∥a∥∥b∥ to get aT b ≤ ∥a∥∥b∥

apply to −a, b to get other half of Cauchy–Schwarz inequality
Angle

angle between two nonzero vectors a, b defined as

( T )
a b
∠(a, b) = arccos
∥a∥∥b∥

∠(a, b) is the number in [0, π] that satisfies

aT b = ∥a∥∥b∥ cos(∠(a, b))

coincides with ordinary angle between vectors in 2-D and 3-D; measures
distance along sphere
Correlation coefficient

vectors a and b, and de-meaned vectors ã and b̃

correlation coefficient between a and b

ãT b̃
ρ=
∥ã∥∥b̃∥

ρ = cos(∠(a, b))
ρ = 0:
ρ > 0.8 (or so):
ρ < 0.8 (or so):

very roughly: highly correlated means ai and bi are typically both above
(below) their means together
Examples

highly correlated vectors

rainfall time series at nearby locations
daily returns of similar companies in same industry
word count vectors of closely related documents (e.g., same author, topic,
etc.)
sales of shoes and socks (at different locations or periods)
approximately uncorrelated vectors
unrelated vectors
audio signals
Clustering application

given N vectors, x1 , · · · , xN
goal: partition (cluster) into k groups
want vectors in the same group to be close to each other
Example settings

topic discovery and document classification

xi is word count histogram for document i
patient clustering
xi are patient attributes, test results, symptoms
customer market segmentation
xi is purchase history and other attributes of customer i
financial sectors
xi are vectors of financial attributes of company i
Clustering objective

Gj ⊂ {1, · · · , N } is group j, for j = 1, · · · , k

ci is group that xi is in: i ∈ Gci
group representatives: vectors z1 , · · · , zk
clustering objective is

1 ∑
N
J= ∥xi − zci ∥2
N i=1

mean square distance from vectors to associated representative

J small means good clustering
goal: choose clustering ci and representatives zj to minimize J
Partitioning the vectors given the representatives

suppose representatives z1 , . . . , zk are given

how do we assign the vectors to groups, i.e., choose c1 , · · · , cN

ci only appears in term ∥xi − zci ∥2 in J

to minimize over ci , choose ci so ∥xi − zci ∥2 = minj ∥xi − zj ∥2
i.e., assign each vector to its nearest representative
Choosing representatives given the partition

given the partition G1 , · · · , Gk , how do we choose representatives

z1 , · · · , zk to minimize J?
J splits into a sum of k sums, one for each zj :
∑
J = J1 + · · · + Jk , Jj = (1/N ) ∥xi − zj ∥2
i∈Gj

so we choose zj to minimize mean square distance to the points in its

partition
this is the mean (or average or centroid) of the points in the partition:
∑
zj = (1/|Gj |) xi
i∈Gj
k-means algorithm

alternate between updating the partition, then the representatives

a famous algorithm called k-means
objective J decreases in each step

given x1 , . . . , xN and z1 , . . . , zk
repeat
update partition: assign i to Gj , j = argminj ′ ∥xi − zj ′ ∥2
∑
update centroids:. zj = G1j i∈Gj xi
until z1 , · · · , zk stop changing
1 def kmeans(x, k, maxiters = 30, tol = 1e-4):
2 N, d = x.shape
3 distances = np.zeros(N) # store dists to nearest repr
4 initial = np.random.choice(N, k, replace=False) # initial grp repr
5 reps = x[initial, :] # store representatives
6 assignment = np.zeros(N, dtype = np.int) # asst of vectors to grps
7 Jprev = np.infty
8 for iter in range(maxiters):
9 # for each x[i], find distance to nearest repr and group index
10 for i in range(N):
11 ci = np.argmin([la.norm(x[i] - reps[j]) for j in range(k)])
12 assignment[i] = ci
13 distances[i] = la.norm(x[i] - reps[ci])
14 # cluster j representative is average of points in cluster j
15 for j in range(k):
16 group = [i for i in range(N) if assignment[i] == j]
17 reps[j] = np.sum(x[group], axis=0) / len(group)
18 # Compute clustering objective
19 J = la.norm(distances)**2 / N
20 # convergence
21 if (iter > maxiters) or (np.abs(J - Jprev) < tol * J):
22 break
23 Jprev = J
24 return assignment, reps
Convergence of k-means algorithm

J goes down in each step, until the zj ’s stop changing

but (in general) the k-means algorithm does not find the partition that
minimizes J
k-means is a heuristic: it is not guaranteed to find the smallest possible
value of J
the final partition (and its value of J) can depend on the initial
representatives
common approach:
run k-means 10 times, with different (often random) initial representatives
take as final partition the one with the smallest value of J
Example
Iteration 1
Iteration 2
Iteration 3
Iteration 10
Final clustering
Convergence
Handwritten digit image set

MNIST images of handwritten digits

N = 60, 000 28 × 28 images, represented as 784-vectors xi
25 examples shown below
k-means image clustering

k = 20, run 20 times with different initial assignments

convergence shown below (including best and worst)
Group representatives, best clustering
Topic discovery

N = 500 Wikipedia articles, word count histograms with D = 4423

k = 9, run 20 times with different initial assignments
convergence shown below (including best and worst)
Topics discovered (clusters 1–3)

words with largest representative coefficients

titles of articles closest to cluster representative

Applications

core routine in unsupervised learning and exploratory data analysis

classification
recommendation engine
suppose xi gives the number of times has listened to or streamed each song
from a library of d songs
clustering the vectors reveals groups of users of similar musical taste
group representatives have a nice interpretation: (zj )i is the average of
times users in group j listened to song i
suggest to user i songs they haven’t listened to but others in their group
have listened to most often
for example, to recommend 5 songs to user i, we identify the cluster j they
belong to and then find the indices l with (xi )l = 0, with the 5 largest
values of (zj )l

guessing missing entries

Linear independence

a set of vectors {a1 , . . . , ak } is linearly dependent if

β1 a1 + · · · + βk ak = 0

holds for some β1 , . . . , βk , that are not all zero

equivalent to: at least one ai is a linear combination of the others
a set of vectors {a1 , . . . , ak } is linearly independent if

β1 a1 + · · · + βk ak = 0

holds only when β1 = · · · = βk = 0

example: the unit vectors e1 , . . . , en are linearly independent
Linear combinations of linearly independent vectors

suppose x is a linear combination of linearly independent vectors

a1 , . . . , ak :
x = β1 a1 + · · · + βk ak

the coefficients β1 , . . . , βk are unique, i.e., if

x = γ1 a1 + · · · + γk ak

then βi = γi for all i

Basis

a linearly independent set of n-vectors can have at most n elements;

put another way: any set of n + 1 or more vectors is linearly dependent
a set of n linearly independent vectors in Rn is called a basis
any vector b in Rn can be expressed as a linear combination of them
and these coefficients are unique
example: e1 , . . . , en is a basis, expansion of b is

b = b1 e1 + · · · + bn en
Orthonormal vectors

set of vectors a1 , , . . . , ak are mutually orthogonal is ai ⊥ aj for i ̸= j

they are normalized if ∥ai ∥ = 1 for i = 1, . . . , k
they are orthonormal if both hold
can be expressed using inner products as
{
T 1 i=j
ai aj =
0 i ̸= j

orthonormal sets of vectors are linearly independent

by independence-dimension inequality, must have k ≤ n
when k = n, a1 , . . . , an are an orthonormal basis
Orthonormal expansion

If a1 , . . . , an is an orthonormal basis, we have for any vector x

x = (aT1 x)a1 + . . . , (aTn x)an

called orthonormal expansion of x (in the orthonormal basis)

to verify formula, take inner product of both sides with ai
Gram–Schmidt (orthogonalization) algorithm

an algorithm to check if a1 , . . . , ak are linearly independent

appears in core routines of other algorithms

given vectors a1 , . . . , ak
for i = 1, . . . , k
1. orthogonalization: q̃i = ai − (q1T ai )q1 − · · · − (qi−1
T
ai )qi−1
2. test for linear dependence:. if q̃i = 0, quit
3.normalization:. qi = q̃i /∥qi ∥
until z1 , · · · , zk stop changing

if G–S does not stop early (in step 2), a1 , . . . , ak are linearly independent
if G–S stops early in iteration i = j, then aj is a linear combination of
a1 , . . . , aj−1 (so a1 , . . . , ak are linearly dependent)
Example
Analysis of Gram–Schmidt

we can show by induction that q1 , . . . , qi are orthonormal

ai is a linear combination of q1 , . . . , qi :

ai = ∥q̃i ∥qi + (q1T ai )q1 + · · · + (qi−1

T
ai )qi−1

qi is a linear combination of a1 , . . . , ai : by induction on i,

( )
qi = (1/∥q̃i ∥) ai − (q1T ai )q1 + − · · · − (qi−1
T
ai )qi−1

and (by induction assumption) each q1 , . . . , qi−1 is a linear combination

of a1 , . . . , ai−1
if G-S terminates in step j, aj is a linear combination of a1 , . . . , aj−1

aj = (q1T aj )q1 + · · · + (qj−1

T
aj )qj−1
Complexity of Gram–Schmidt

step 1 of iteration i requires i−1 inner products,

q1T ai , . . . , qi−1
T
ai

which costs (i − 1)(2n − 1) flops

2n(i − 1) flops to compute q̃i
3n flops to compute ∥q̃i ∥ and qi
total is
∑
k
k(k − 1)
((4n − 1)(i − 1)) = (4n − 1) + 3nk ≈ 2nk 2
i=1
2
∑k
using the i=1 (i − 1) = k(k − 1)/2
in practice, we use a more numerically-stable variant, called modified
Gram-Schmidt
Matrices

an m × n matrix A is a array of m rows and n columns of numbers

Aij , Ap:q,r:s notation for accessing entries and blocks
block matrices
A maybe written as a block matrix of (column) vectors a1, ..., an
[ ]
A = a1 a2 · · · an

or as a block matrix with its row vectors b1 , . . . , bm

 
b1
 b2 
 
A= . 
 .. 
bm

special matrices: zero, identity, diagonal, triangular, square, sparse

Matrix uses

store tabular data:

image: Xij is pixel value in grayscale image
feature matrix: Xij is value of feature j for entity i
become tensors when every entity is stored as a matrix

represent operators f : Rn → Rm
map vectors in Rn to vectors in Rm
via matrix-vector multiplication operation
Operations

transpose
addition, subtraction, and scalar multiplication
matrix-vector multiplication u = Av
row interpretation, ui = bTi v
example: A1 is vector of row sums
column interpretation, u = Av = v1 a1 + v2 a2 + · · · vn an
linear combination of columns of A
example: Aej = aj
columns of A are linearly independent if Av implies v = 0
arithmetic complexity:
2mn flops
for sparse matrices, 2 nnz(A), twice the number of non-zeros in A
Selectors

am m × n selector matrix: each row is a unit vector transposed

 T 
ek 1
 .. 
A= . 
eTkm

multiplying by A selects entries of v

Av = (vk1 , vk2 , . . . , vkm )
example: the m × 2m matrix
 
1 0 0 0 ··· 0 0
0 0 1 0 ··· 0 0
 
A = . .. .. .. .. .. 
 .. . . . . .
0 0 1 0 ··· 1 0
down-samples by 2: if v is a 2m-vector, then v = Av = (v1 , v3 , . . . , v2m−1 )
other examples: permutation, image cropping
Norm

the Frobenius norm is defined as

 1/2
∑
m ∑
n
∥A∥F =  A2ij 
i=1 j=1

agrees with vector norm when n = 1

satisfies norm properties

∥αA∥ = |α|∥A∥, ∥A + B∥ ≤ ∥A∥ + ∥B∥, ∥A∥ ≥ 0, ∥A∥ = 0 only if A = 0

distance between two matrices: ∥A − B∥

other matrix norms:
∥A∥2 , measures a matrix by its largest magnifying effect on vectors
Linear functions

f : Rn → Rm is linear if it satisfies superposition

f (αx + βy) = αf (x) + βf (y)

the matrix-vector multiplication function is linear

f (αx + βy) = A(αx + βy)

= A(αx) + A(βy)
= α(Ax) + β(Ay) = αf (x) + βf (y)

conversely, if f is linear, then

f (x) = f (x1 e1 + x2 e2 + · · · + xn en )
= x1 f (e1 ) + x2 f (e2 ) + · · · + xn f (en )
= Ax
[ ]
with A = f (e1 ) f (e2 ) · · · f (en )
Examples

reversal: f (x) = (xn , xn−1 , . . . , x1 )

 
0 ... 0 1
0 . . . 1 0
 
A = . .. .. 
 .. . . . . .
1 ... 0 0

running sum: f (x) = (x1 , x1 + x2 , x1 + x2 + x3 , . . . , x1 + · · · + xn )

 
1 0 ... 0 0
1 1 . . . 0 0
 
 
A =  ... ... . . . ... ... 
 
1 1 . . . 1 0
1 1 ... 1 1
Affine functions

f : Rn → Rm is affine it it is linear function plus constant

f (x) = Ax + b

can find A and b from f

[ ]
A = f (e1 ) − f (0) f (e2 ) − f (0) · · · f (en ) − f (0)
b = f (0)

affine functions are also called “linear”

in many applications, relations between vectors in Rn and vectors Rm are
approximated as linear or affine
Matrix multiplication

can multiply m × p matrix A and p × n matrix B to get C = AB

inner product interpretation
with aTi the rows of A, bj the columns of B
 
aT1 b1 aT1 b2 ··· aT1 bn
 aT2 b1 aT2 b2 ··· aT2 bn 
 
AB =  . .. .. 
 .. . . 
aTm b1 aTm b2 ··· aTm bn

outer product interpretation

with ai the columns of A, bT
j the rows of B

1 + · · · + ap bp
AB = a1 bT T
Matrix multiplication

Column interpretation
denote columns of B by bi
[ ]
B = b1 b2 ··· bn

then we have
[ ]
AB = A b1 b2 · · · bn
[ ]
= Ab1 Ab2 · · · Abn

so AB is batch multiply of A times columns of B

some properties
(AB)C = A(BC)
A(B + C) = AB + AC
(AB)T = B T AT
AB = BA does not hold in general
Complexity

to compute Cij = (AB)ij is inner product of vectors of size p

so total required flops is (mn)(2p) = 2mnp flops
multiplying two 1000 × 1000 matrices requires 2 billion flops
and can be done in a small fraction of a second on current computers
Gram–Schmidt in matrix notation

run Gram–Schmidt on columns a1 , . . . , an of m × n matrix A

if columns are linearly independent, get orthonormal q1 , . . . , qn
define m × n matrix Q with columns q1 , . . . , qk
QT Q = I
from Gram–Schmidt algorithm

ai = (q1T ai )q1 + · · · + (qi−1

T
ai )qi−1 + ∥q̃∥qi
= R1i q1 + · · · + Rii qi

with Rij = qiT aj for i < j and Rii = ∥q̃i ∥

defining Rij = 0 for i > j, we have A = QR
R is upper triangular, with positive diagonal entries
QR factorization

A = QR is called the QR factorization of A

factors satisfy QT Q = I, R upper triangular with positive diagonal entries
numpy.linalg.qr(A)
one of the four important matrix decompositions
the others being LU , QΛQ−1 , U ΣV T
Matrix inverse

if A is square and invertible, then its inverse is the matrix A−1 such that

AA−1 = A−1 A = I

some properties
(AB)−1 = B −1 A−1
(AT )−1 = (A−1 )T , sometimes denoted as A−T

more on pseudo-inverses and inverses of rectangular matrices later

Regression model setup

basic regression model: ŷ = fˆ(x) = b + wT x

x is a vector of features, ŷ is our prediction of the actual outcome y
by absorbing the bias term into the parameter vector θ = (b, w), and
redefining the feature vectors as (1, x1 , . . . , xn ), we can simplify notation
and write the regression model as ŷ = fˆ(x) = θT x

more generally, we can write

fˆ(x) = θ1 f1 (x) + · · · + θp fp (x)

where fi : Rn → R are feature mappings we choose

note that if we define f1 (x) = 1, f2 (x) = x1 , · · · , fp (x) = xn , we can back
the basic model above
“feature engineering” is the task of identifying appropriate feature maps

these are linear models in the parameters θ

Regression model setup

now suppose we have N examples x1 , . . . , xN and associated responses

y1 , . . . , y N
∑p
associated predictions are ŷi = j=1 θj fj (xi )
write as ŷ = Aθ
Aij = fj (xi ) is the jth (mapped) feature of the ith training vector
the rows of AN ×p are the (mapped) features of training vectors xi
for the basic model, this would be Xθ where X is the matrix with rows
x1 , . . . , xN , the training vectors
ŷ is a vector of N predictions

prediction error vector is e = y − ŷ

least squares data fitting
choose model parameters θ to minimize rms(e)
square loss
Least squares problem

N ∥Aθ − y∥2
1
choose θ to minimize mean square loss

θ̂ = argmin∥Aθ − y∥2
θ

θ̂ is a solution of the least squares problem if

∥Aθ̂ − y∥2 ≤ ∥Aθ − y∥2

for any vector θ

idea: θ̂ makes residual as small as possible
Solution of least squares problem

we make one assumption: A has linearly independent columns

this makes AT A non-singular
unique solution of least squares problem is

θ̂ = (AT A)−1 AT y

(AT A)−1 AT is called the pseudo-inverse of A

Derivation via calculus

define  2
∑
m ∑n
f (θ) = ∥Aθ − y∥2 =  Aij θj − yi 
i=1 j=1

solution x̂ satisfies ∇f (θ̂) = 0

∂f
(θ̂) = ∇f (θ̂)k = 0, k = 1, . . . , n
∂θk

taking partial derivatives we get

 
∂f ∑m ∑ n
∇f (θ)k = (θ) = 2 Aij θj − yi ) (Aik )
∂θk i=1 j=1

∑
m
= 2(AT )ki (Aθ − y)i
i=1
( )
= 2AT (Aθ − y) k
Derivation (ctd)

∇f (θ̂) = 2AT (Aθ̂ − y) = 0

so θ̂ satisfies normal equations (AT A)θ̂ = AT y
and therefore θ̂ = (AT A)−1 AT y

We can also get to the normal equations using matrix notation:

∥Aθ − y∥2 = (Aθ − y)T (Aθ − y)

= θT AT Aθ − y T Aθ − θT AT y + y T y
= θT AT Aθ − 2θT AT y + y T y

the gradient is then 2AT Aθ − 2AT y

setting it to zero gives us θ̂ and the solution of AT A θ̂ = AT y
Direct verification

let θ̂ = (AT A)−1 AT y, so AT (Aθ̂ − y) = 0

for any vector x we have

∥Aθ − y∥2 = ∥(Aθ − Aθ̂) + (Aθ̂) − y)∥2

= ∥A(θ − θ̂)∥2 + ∥Aθ̂ − y∥2 + 2(A(θ − θ̂))T (Aθ̂ − y)
= ∥A(θ − θ̂)∥2 + ∥Aθ̂ − y∥2 + 2(θ − θ̂)T AT (Aθ̂ − y)
= ∥A(θ − θ̂)∥2 + ∥Aθ̂ − y∥2

so for any θ, ∥Aθ − b∥2 ≥ ∥Aθ̂ − y∥2

if equality holds, A(θ − θ̂) = 0. which implies θ = θ̂ since columns of A
are linearly independent
Computing least squares approximate solutions

compute the QR factorization of A: A = QR

solution can be written as

θ̂ = (AT A)−1 AT y
= (RT QT QR)−1 RT QT y
= R−1 R−T RT QT y
= R−1 QT y

therefore to compute θ̂
form QT b
compute x̂ = R−1 (QT y) via back substitution
Complexity of least squares solution

QR factorization (2mn2 flops)

form QT b (2mn flops)
back substitution (n2 flops)
total complexity 2mn2 , O(mn2 ) flops
main cost is the factorization
cost of multiple right hand sides is essentially the cost of a single one
Example: Polynomial fit

fi (x) = xi−1 , i = 1, . . . , p
model is a polynomial of degree than p

fˆ(x) = θ1 + θ2 x + · · · + θp xp−1

(xi means a scalar raised to the power i)

A is the matrix  
1 x1 ··· xp−1
1
1 x2 ··· xp−1 
 2 
A = . .. .. 
 .. . . 
1 xN ··· xp−1
N
Generalization

basic idea:
goal of model is not to predict outcome for the given data
instead it is to predict the outcome on new, unseen data

a model that makes reasonable predictions on new, unseen data has

generalization ability, or generalizes
a model that makes poor predictions on new, unseen data is said to suffer
from over-fit
Out-of-sample validation

a simple and effective method to guess if a model will generalize

split original data into a training set and a test set
typical splits: 80%/20%, 90%/10%
train model on training data set
then check the model’s predictions on the test data set
can also compare RMS prediction error on train and test data
if they are similar, we can guess the model will generalize
Out-of-sample validation

can be used to choose among different candidate models, e.g.

polynomials of different degrees
regression models with different sets of features
we would use one with low, or lowest, test error
Example
models fit using training set of 100 points; plots show test set of 100 points
RMS on test data

degree 4, 5, or 6 are reasonable choices here

Cross validation

extension of out-of-sample validation to get more confidence in the

generalization ability of a model
to carry out cross validation:
divide data into 5 (or 10) folds
for i = 1, . . . , 5 build (train) model using all folds except i
test model on data in fold i
interpreting cross validation results:
if test RMS errors are much larger than train RMS errors, model is
over-fit
if test and train RMS errors are similar and consistent, we can guess the
model will have a similar RMS error on future data
√∑
RMS cross validation error is i ϵ /nfolds where ϵi is the RMS
2

prediction error on fold i

Example

house price, regression fit with x = (area/1000 ft2 , bedrooms)

774 sales, divided into 5 folds of 155 sales each
fit 5 regression models, removing each fold
Feature engineering

start with original or base feature vector x

choose mapping functions f1 , . . . , fp to create “mapped” feature vector

f1 (x), . . . , fp (x)

now fit linear in parameters model with mapped features

ŷ = θ1 f1 (x) + · · · + fp (x)

check the model using validation

Transforming features

standardizing features: replace jth feature xj with

(xj − bj )/aj

bj , mean value of the feature across the data

aj , standard deviation of the feature across the data

log transform: if xj is nonnegative and spans a wide range, replace it with

log(1 + xj )

clipping values
Creating new features

expanding categoricals
features that take on only few values, e.g., booleans, Likert scale, day of
week
one-hot encoding: expand a categorial feature with l values into l − 1
features that encode whether the feature has one of the (non-default) values
example: bedrooms in the house price prediction problem

hi and lo features: create new features given by

max{xj − b, 0}, min{xj + a, 0}

amount by which feature xj is below −a, or above b

products and interactions
from the original features we can add xj xk for j, k = 1, . . . , n, j ≤ k
products are used to model interactions among the features
More feature generation methods

custom mappings, often involving domain-specific information, e.g., P/E

ratio in financial application, TFIDF in document analysis, etc.
prediction from other models
distance to cluster representatives
random features
generate a K × n matrix R and generate new features as (Rx)+ , or other
nonlinear functions on elements of Rx
a bit counterintuitive but can be very effective in some applications
neural network features
can find good feature mappings directly from the data
useful when lots of data is available, more on this soon

but beware: adding new features can easily lead to over-fit. Keep the
model simple. Validate the model.
Example

house price prediction

start with base features
x1 is area of house (in 1000ft2 )
x2 is number of bedrooms
x3 is 1 for condo, 0 for house
x4 is zip code of address (62 values)
use 8 mapped features
f1 (x) = 1, f2 (x) = x1 , f3 (x) = max{x1 − 1.5, 0}
f4 (x) = x2 , f5 (x) = x3
f6 (x), f7 (x), f8 (x) are Boolean functions of x4 which encode 4 groups of
nearby zip codes (i.e., neighborhood)

5-fold model validation

Example
Application to classification

data fitting with outcome that takes on (non-numerical) values, labels

true or false
spam or not spam
dog, horse, or mouse

classifier has the form ŷ = fˆ(x), f : Rn → {−1, +1} for Boolean or

2-way classification
can be done via least squares
very simple methods
but we will soon look at logistic regression, support vector machines, and
other more sophisticated methods
Prediction errors

data point (x, y), predicted outcome ŷ = fˆ(x)

only four possibilities
True positive, y = +1 and ŷ = +1
True negative, y = −1 and ŷ = −1
(in these two cases, the prediction is correct)
False positive, y = −1 and ŷ = +1
False negative, y = +1 and ŷ = −1
Confusion matrix

given data set x1 , . . . , xN , y1 , . . . , yN , and classifier fˆ

count each of the four outcomes

off-diagonal terms are prediction errors

many error rates and accuracy measures are used
error rate (Nf p + Ntp )/N
true positive (or recall) rate, Ntp /Np
false positive (or false alarm) rate, Nf p /Nn
or equivalently, true negative rate (or specificity) rate, Ntn /Nn
precision, Ntp /(Ntp + Nf p )

out-of sample validation and cross validation

Example

spam filter performance on a test set

error (19 + 32)/1266 = 4.03%

false positive rate is 19/1139 = 1.67%
Least squares classification

fit model f˜ to encoded (±1)yi values using standard least squares data
fitting
f˜ should be near +1 when y = +1, and near −1 when y = −1
f˜ is a number
use model fˆ(xi ) = sign(f˜(xi ))
(size of f˜(xi ) is related to the “confidence” in the prediction)
Handwritten digits example

MNIST data set of 70000 28 × 28 images of digits 0, . . . , 9

divided into training set (60000) and test set (10000)
every training image (xi ) is a vector of 494 pixel values (ignore pixels
that are zeros in all images)

y = +1 if digit is 0; −1 otherwise
Least squares classifier results

training set results (error rate 1.6%)

test set results (error rate 1.6%)

we can likely achieve 1.6% error rate on unseen images

Distribution of least squares fit

distribution of values of f˜(xi ) over training set

Coefficients in least squares classifier
Skewed decision threshold

use predictor fˆ(xi ) = sign(f˜(xi ) − α), i.e.,

{
+1 f˜(x) ≥ α
fˆ(x) =
−1 f˜(x) < α

α is the decision threshold

for positive α, false positive rate is lower but so is true positive rate
for negative α, false positive rate is higher but so is true positive rate
trade off curve of true positive versus false positive rates is called receiver
operating characteristic (ROC)
Example
ROC curve
Multi-class classifiers

we have K > 2 possible labels, with label set {1, . . . , K}

predictor is fˆ : Rn → {1, . . . , K}
for given predictor and data set, confusion matrix is K × K
some off-diagonal entries may be much worse than others
Examples

handwritten digit classification, guess the digit written, from the pixel
values
marketing demographic classification, guess the demographic group, from
purchase history
disease diagnosis, guess diagnosis from among a set of candidates, from
test results, patient features
translation word choice, choose how to translate a word into several
choices, given context features
document topic prediction, guess topic from word count histogram
Least squares multi-class classifier

create a least squares classifier for each label versus the others
take as classifier
fˆ(x) = argmax f˜ℓ (x)
ℓ∈{1,...,K}

(i.e., choose ℓ with largest value of f˜ℓ (x))

for example, with

f˜1 (x) = −0.7, f˜2 (x) = +0.2, f˜3 (x) = +0.8

we choose fˆ(x) = 3
Handwritten digit classification

confusion matrix, test set

error rate is around 14% (same as for training set)

Adding new features

let’s add 5000 random features (!), max{(Rx)j , 0}

R is 5000 × 494 matrix with entries ±1, chosen randomly
now use least squares classification with 5494 feature vector
results: training set error 1.5%, test set error 2.6%
can do better with a little more thought in generating new features
Results with new features

confusion matrix, test set

Convolution

for a ∈ Rn and b ∈ Rm , the convolution c = a ∗ b is the (n + m − 1)

vector ∑
ck = ai bj , k = 1, . . . , n + m − 1
i+j=k+1

for example with n = 4, m = 3, we have

c1 = a1 b1
c2 = a1 b2 + a2 b1
c3 = a1 b + 3 + a2 b2 + a3 b1
c4 = a2 b3 + a3 b2 + a4 b1
c 5 = a 3 b3 + a 4 b2
c 6 = a 4 b3

(1, 0, −1) ∗ (2, 1, −1) = (2, 1, −3, −1, 1)

Polynomial multiplication

a and b are coefficients of two polynomials

p(x) = a1 + a2 x + · · · + an xn−1 , q(x) = b1 + b2 x + · · · + bm xm−1

convolution c = a ∗ b gives the coefficients of the product p(x)q(x)

p(x)q(x) = c1 + c2 x + · · · + cn+m−1 xn+m−1

this gives simple proofs of many properties of convolution; for example,

a∗b=b∗a
(a ∗ b) ∗ c = a ∗ (b ∗ c)
a ∗ b = 0 only if a = 0 or b = 0
Toeplitz matrices

can express c = a ∗ b using matrix-vector multiplication as c = T (b)a,

with  
b1 0 0 0
b2 b1 0 0
 
b3 b2 b1 0 
T (b) = 
 0 b3

 b2 b1 
0 0 b3 b2 
0 0 0 b3

T (b) is a Toeplitz matrix (values on diagonals are equal)

Moving average of time series

vector x represents a time series

convolution y = a ∗ x with a = (1/3, 1/3, 1/3) is 3-period moving average

yk = 13 (xk + xk−1 + xk−2 ), k = 1, 2, . . . , n + 2

(with xk interpreted as zero for k < 1 and k > n)

Matrix inverse

if A is square and invertible, then its inverse is the matrix A−1 such that

AA−1 = A−1 A = I

some properties
(AB)−1 = B −1 A−1
(AT )−1 = (A−1 )T , sometimes denoted as A−T
Left inverse

a matrix X that satisfies XA = I is called a left inverse of A

if a left inverse exists we say that A is left-invertible
example: the matrix  
−3 −4
A= 4 6
1 1
has two different left inverses:
[ ] [ ]
1 −11 −10 16 1 0 −1 6
B= , C=
9 7 8 −11 2 0 1 −4
Left inverse and column independence

if A has a left inverse C, then the columns of A are linearly independent

to see this: of Ax = 0 and CA = I then

0 = C0 = C(Ax) = (CA)x = Ix = x

converse is also true

a matrix is left-invertible if and only if its columns are linearly independent

matrix generalization of: a number is invertible if and only if it is nonzero

so left-invertible matrices are tall or square
Right inverse

similarly, right inverses can be defined for wide or square matrices

a matrix A is right-invertible if and only if AT is left-invertible
a matrix is right-invertible if and only if its rows are linearly independent
Linear equations

Ax = b

for relatively small matrices, the methods of choice are factor-solve

methods
a matrix factorization is first computed, independently of the rhs b
and then used to compute x

we write the solution as x = A−1 b, but we almost never generate an

inverse
np.linalg.solve(A, b)
Linear equations

triangular decomposition for general square matrices A = LU

  
1 u11 u12 u13 ··· u1n
 l21 1  u22 u23 ··· u2n 
  
 l31 l32 1  u33 ··· u3n 
  
 .. .. .
. . .  .. .. 
 . . . .  . . 
ln1 ln2 ln3 · · · 1 unn

2 3
3n flops
for SPD matrices, A = LLT (or RT R)
the Cholesky decomposition
2 3
3
n flops
available as np.linalg.cholesky(A)
Forward/backward substitutions for triangular systems

(LU )x = b solved in two steps

Ly = b
Ux = b
linear systems with triangular coefficient matrices
n2 flops
Multiple right hand sides

AX = B

let’s solve Axi = bi , i = 1, . . . , k, with A invertible

carry out LU factorization once ( 23 n3 )
for i = 1, . . . , k, solve LU x = bi via forward/back substitutions (2kn2 )
total is 23 n3 + 2kn2
if k is small compared to n, same cost as solving one set of equations!
Singular value decomposition (SVD)

For tall matrices,

T
Am×n = Um×n Σn×n Vn×n

U and V are orthonormal matrices

Σ is a diagonal matrix with positive decreasing values, called singular
values
truncated SVD allows us to generate low rank approximations of data
matrices
∑
k
T
Âk = U1:k Σ1:k,1:k V1:k = σi Ui ViT
i=1

(∑n )1/2
∥A − Âk ∥F ≤ i=k+1 σi2

∥A − Âk ∥2 ≤ σk+1
best low rank approximation
Singular value decomposition (SVD)
Eigenvalue decomposition (EVD)

For real symmetric matrices,

∑
n
A = QΛQT = λi qi qiT
i=1

Q is orthonormal
Λ is a diagonal matrix of eigenvalues
positive definite (SPD) matrices have all positive eigenvalues,
xT Ax > 0 ∀x ̸= 0

A−1 = QΛ−1 QT

Additional Exercises For Vectors, Matrices, and Least Squares
No ratings yet
Additional Exercises For Vectors, Matrices, and Least Squares
41 pages
Lecture 3 Introduction to Linear Algebra (Part 2)
No ratings yet
Lecture 3 Introduction to Linear Algebra (Part 2)
57 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
02 - Math of Patter Recognition
No ratings yet
02 - Math of Patter Recognition
31 pages
Roch Mmids Intro 5exercises
No ratings yet
Roch Mmids Intro 5exercises
9 pages
Math For Machine Learning
No ratings yet
Math For Machine Learning
1 page
ECON2125/8013 Maths Notes: John Stachurski March 4, 2015
No ratings yet
ECON2125/8013 Maths Notes: John Stachurski March 4, 2015
162 pages
Linear Algebra For Business Analytics
No ratings yet
Linear Algebra For Business Analytics
27 pages
CSC 5800: Intelligent Systems:: Algorithms and Tools
No ratings yet
CSC 5800: Intelligent Systems:: Algorithms and Tools
8 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
2 pages
STA2005S Regression
No ratings yet
STA2005S Regression
92 pages
Kuttler LinearAlgebra AFirstCourse Yorku MATH2022 Summer2016
No ratings yet
Kuttler LinearAlgebra AFirstCourse Yorku MATH2022 Summer2016
256 pages
Summary Linear Algebra and Multivariable Calculus For Chemistry 19-20
No ratings yet
Summary Linear Algebra and Multivariable Calculus For Chemistry 19-20
17 pages
Linear Alg
No ratings yet
Linear Alg
3 pages
Lab. Manual PDF
No ratings yet
Lab. Manual PDF
310 pages
Reader 133A
No ratings yet
Reader 133A
164 pages
MATH100 Lecture Notes
No ratings yet
MATH100 Lecture Notes
37 pages
Math Review For ML
No ratings yet
Math Review For ML
41 pages
MA2001 Review PDF
No ratings yet
MA2001 Review PDF
27 pages
MA412 Final
No ratings yet
MA412 Final
82 pages
NN Notes PDF
No ratings yet
NN Notes PDF
126 pages
Feature Extraction
No ratings yet
Feature Extraction
90 pages
Regression 1
No ratings yet
Regression 1
63 pages
01 - Lab Notes
No ratings yet
01 - Lab Notes
8 pages
Draft: Lecture Notes On Discrete Mathematics
No ratings yet
Draft: Lecture Notes On Discrete Mathematics
195 pages
Convex Optimization Prerequisite_topics
No ratings yet
Convex Optimization Prerequisite_topics
6 pages
2016 STATS 302 Course Book
No ratings yet
2016 STATS 302 Course Book
160 pages
Linear Algebra For Data Science 9811276226 9789811276224 - Compress
100% (2)
Linear Algebra For Data Science 9811276226 9789811276224 - Compress
257 pages
EC400 Revision Notes All
No ratings yet
EC400 Revision Notes All
94 pages
Multivariate Statistics - An Introduction 8th Edition
100% (1)
Multivariate Statistics - An Introduction 8th Edition
202 pages
Homework 2 MATH2050
No ratings yet
Homework 2 MATH2050
10 pages
MATHEMATICS, Lecture 1: Carmen Herrero
No ratings yet
MATHEMATICS, Lecture 1: Carmen Herrero
28 pages
ESTADOSTICA
No ratings yet
ESTADOSTICA
190 pages
斯坦福大学机器学习数学基础 33-40
No ratings yet
斯坦福大学机器学习数学基础 33-40
8 pages
Matrix Algebra and Random Vectors
No ratings yet
Matrix Algebra and Random Vectors
37 pages
Linear Algebra Summary
No ratings yet
Linear Algebra Summary
34 pages
Statistics
100% (1)
Statistics
515 pages
Cs421 Cheat Sheet
No ratings yet
Cs421 Cheat Sheet
2 pages
Vectors
No ratings yet
Vectors
80 pages
Prob RV Opt Basics
No ratings yet
Prob RV Opt Basics
35 pages
Pattern Classification
No ratings yet
Pattern Classification
41 pages
Lecture Notes On Linear Algebra: Aklal S Pati
No ratings yet
Lecture Notes On Linear Algebra: Aklal S Pati
176 pages
Linear Algebra Lecture Notes
No ratings yet
Linear Algebra Lecture Notes
176 pages
Multivariate Statistics With R
No ratings yet
Multivariate Statistics With R
190 pages
HKU MATH1853 - Brief Linear Algebra Notes: 1 Eigenvalues and Eigenvectors
No ratings yet
HKU MATH1853 - Brief Linear Algebra Notes: 1 Eigenvalues and Eigenvectors
6 pages
Very Sparse Random Projections: Ping Li Trevor J. Hastie Kenneth W. Church
No ratings yet
Very Sparse Random Projections: Ping Li Trevor J. Hastie Kenneth W. Church
10 pages
Kuttler LinearAlgebra AFirstCourse YorkU MATH2022 Winter2017
No ratings yet
Kuttler LinearAlgebra AFirstCourse YorkU MATH2022 Winter2017
258 pages
Gi Part1 PDF
No ratings yet
Gi Part1 PDF
57 pages
Notes For Multivariate Statistics With R
No ratings yet
Notes For Multivariate Statistics With R
189 pages
Solution
No ratings yet
Solution
148 pages
TA Notes PDF
No ratings yet
TA Notes PDF
19 pages
Selected Linear Algebra for Machine Learning
No ratings yet
Selected Linear Algebra for Machine Learning
30 pages
La PDF
No ratings yet
La PDF
208 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Rbs Linalg
No ratings yet
Rbs Linalg
158 pages
Vmls - 103exercises
No ratings yet
Vmls - 103exercises
50 pages
Maths$Stats_NOTES.docx
No ratings yet
Maths$Stats_NOTES.docx
50 pages
Linear_Algebra_LectureNote
No ratings yet
Linear_Algebra_LectureNote
288 pages