CH 2
CH 2
special vectors: 0, 1, ei
Examples
a short document
aT b = a1 b1 + a2 b2 + · · · + an bn
ŷ = wT x + b = θx̃
∥a − c∥ = ∥(a − b) + (b − c)∥ ≤ ∥a − b∥ + ∥b − c∥
∥x − zj ∥ ≤ ∥x − zi ∥, i = 1, . . . , m
avg(x) = 1T x/n
de-meaned vector is x̃ = x − avg(x)1 (so avg(x̃) = 0)
standard deviation of x
∥x − (1T x/n)1∥
std(x) = rms(x̃) = √
n
standardization (µ = 0, σ = 1, z-scores)
1
z= (x − avg(x)1)
std(x)
Chebyshev inequality for standard deviation
rough idea: most entries of x are not too far from the mean
by Chebyshev inequality, fraction of entries of x with
0 ≤ ∥βa − αb∥2
= ∥βa∥2 − 2(βa)T (αb) + ∥αb∥2
= β 2 ∥a∥2 − 2βα(aT b) + α2 ∥b∥2
= 2∥a∥2 ∥b∥2 − 2∥a∥∥b∥(aT b)
coincides with ordinary angle between vectors in 2-D and 3-D; measures
distance along sphere
Correlation coefficient
ãT b̃
ρ=
∥ã∥∥b̃∥
ρ = cos(∠(a, b))
ρ = 0:
ρ > 0.8 (or so):
ρ < 0.8 (or so):
very roughly: highly correlated means ai and bi are typically both above
(below) their means together
Examples
given N vectors, x1 , · · · , xN
goal: partition (cluster) into k groups
want vectors in the same group to be close to each other
Example settings
1 ∑
N
J= ∥xi − zci ∥2
N i=1
given x1 , . . . , xN and z1 , . . . , zk
repeat
update partition: assign i to Gj , j = argminj ′ ∥xi − zj ′ ∥2
∑
update centroids:. zj = G1j i∈Gj xi
until z1 , · · · , zk stop changing
1 def kmeans(x, k, maxiters = 30, tol = 1e-4):
2 N, d = x.shape
3 distances = np.zeros(N) # store dists to nearest repr
4 initial = np.random.choice(N, k, replace=False) # initial grp repr
5 reps = x[initial, :] # store representatives
6 assignment = np.zeros(N, dtype = np.int) # asst of vectors to grps
7 Jprev = np.infty
8 for iter in range(maxiters):
9 # for each x[i], find distance to nearest repr and group index
10 for i in range(N):
11 ci = np.argmin([la.norm(x[i] - reps[j]) for j in range(k)])
12 assignment[i] = ci
13 distances[i] = la.norm(x[i] - reps[ci])
14 # cluster j representative is average of points in cluster j
15 for j in range(k):
16 group = [i for i in range(N) if assignment[i] == j]
17 reps[j] = np.sum(x[group], axis=0) / len(group)
18 # Compute clustering objective
19 J = la.norm(distances)**2 / N
20 # convergence
21 if (iter > maxiters) or (np.abs(J - Jprev) < tol * J):
22 break
23 Jprev = J
24 return assignment, reps
Convergence of k-means algorithm
β1 a1 + · · · + βk ak = 0
β1 a1 + · · · + βk ak = 0
x = γ1 a1 + · · · + γk ak
b = b1 e1 + · · · + bn en
Orthonormal vectors
given vectors a1 , . . . , ak
for i = 1, . . . , k
1. orthogonalization: q̃i = ai − (q1T ai )q1 − · · · − (qi−1
T
ai )qi−1
2. test for linear dependence:. if q̃i = 0, quit
3.normalization:. qi = q̃i /∥qi ∥
until z1 , · · · , zk stop changing
if G–S does not stop early (in step 2), a1 , . . . , ak are linearly independent
if G–S stops early in iteration i = j, then aj is a linear combination of
a1 , . . . , aj−1 (so a1 , . . . , ak are linearly dependent)
Example
Analysis of Gram–Schmidt
q1T ai , . . . , qi−1
T
ai
represent operators f : Rn → Rm
map vectors in Rn to vectors in Rm
via matrix-vector multiplication operation
Operations
transpose
addition, subtraction, and scalar multiplication
matrix-vector multiplication u = Av
row interpretation, ui = bTi v
example: A1 is vector of row sums
column interpretation, u = Av = v1 a1 + v2 a2 + · · · vn an
linear combination of columns of A
example: Aej = aj
columns of A are linearly independent if Av implies v = 0
arithmetic complexity:
2mn flops
for sparse matrices, 2 nnz(A), twice the number of non-zeros in A
Selectors
f (x) = f (x1 e1 + x2 e2 + · · · + xn en )
= x1 f (e1 ) + x2 f (e2 ) + · · · + xn f (en )
= Ax
[ ]
with A = f (e1 ) f (e2 ) · · · f (en )
Examples
f (x) = Ax + b
1 + · · · + ap bp
AB = a1 bT T
Matrix multiplication
Column interpretation
denote columns of B by bi
[ ]
B = b1 b2 ··· bn
then we have
[ ]
AB = A b1 b2 · · · bn
[ ]
= Ab1 Ab2 · · · Abn
if A is square and invertible, then its inverse is the matrix A−1 such that
AA−1 = A−1 A = I
some properties
(AB)−1 = B −1 A−1
(AT )−1 = (A−1 )T , sometimes denoted as A−T
N ∥Aθ − y∥2
1
choose θ to minimize mean square loss
θ̂ = argmin∥Aθ − y∥2
θ
θ̂ = (AT A)−1 AT y
define 2
∑
m ∑n
f (θ) = ∥Aθ − y∥2 = Aij θj − yi
i=1 j=1
∑
m
= 2(AT )ki (Aθ − y)i
i=1
( )
= 2AT (Aθ − y) k
Derivation (ctd)
θ̂ = (AT A)−1 AT y
= (RT QT QR)−1 RT QT y
= R−1 R−T RT QT y
= R−1 QT y
therefore to compute θ̂
form QT b
compute x̂ = R−1 (QT y) via back substitution
Complexity of least squares solution
fi (x) = xi−1 , i = 1, . . . , p
model is a polynomial of degree than p
fˆ(x) = θ1 + θ2 x + · · · + θp xp−1
basic idea:
goal of model is not to predict outcome for the given data
instead it is to predict the outcome on new, unseen data
f1 (x), . . . , fp (x)
ŷ = θ1 f1 (x) + · · · + fp (x)
(xj − bj )/aj
log(1 + xj )
clipping values
Creating new features
expanding categoricals
features that take on only few values, e.g., booleans, Likert scale, day of
week
one-hot encoding: expand a categorial feature with l values into l − 1
features that encode whether the feature has one of the (non-default) values
example: bedrooms in the house price prediction problem
but beware: adding new features can easily lead to over-fit. Keep the
model simple. Validate the model.
Example
fit model f˜ to encoded (±1)yi values using standard least squares data
fitting
f˜ should be near +1 when y = +1, and near −1 when y = −1
f˜ is a number
use model fˆ(xi ) = sign(f˜(xi ))
(size of f˜(xi ) is related to the “confidence” in the prediction)
Handwritten digits example
y = +1 if digit is 0; −1 otherwise
Least squares classifier results
handwritten digit classification, guess the digit written, from the pixel
values
marketing demographic classification, guess the demographic group, from
purchase history
disease diagnosis, guess diagnosis from among a set of candidates, from
test results, patient features
translation word choice, choose how to translate a word into several
choices, given context features
document topic prediction, guess topic from word count histogram
Least squares multi-class classifier
create a least squares classifier for each label versus the others
take as classifier
fˆ(x) = argmax f˜ℓ (x)
ℓ∈{1,...,K}
we choose fˆ(x) = 3
Handwritten digit classification
c1 = a1 b1
c2 = a1 b2 + a2 b1
c3 = a1 b + 3 + a2 b2 + a3 b1
c4 = a2 b3 + a3 b2 + a4 b1
c 5 = a 3 b3 + a 4 b2
c 6 = a 4 b3
if A is square and invertible, then its inverse is the matrix A−1 such that
AA−1 = A−1 A = I
some properties
(AB)−1 = B −1 A−1
(AT )−1 = (A−1 )T , sometimes denoted as A−T
Left inverse
0 = C0 = C(Ax) = (CA)x = Ix = x
Ax = b
2 3
3n flops
for SPD matrices, A = LLT (or RT R)
the Cholesky decomposition
2 3
3
n flops
available as np.linalg.cholesky(A)
Forward/backward substitutions for triangular systems
AX = B
(∑n )1/2
∥A − Âk ∥F ≤ i=k+1 σi2
∥A − Âk ∥2 ≤ σk+1
best low rank approximation
Singular value decomposition (SVD)
Eigenvalue decomposition (EVD)
∑
n
A = QΛQT = λi qi qiT
i=1
Q is orthonormal
Λ is a diagonal matrix of eigenvalues
positive definite (SPD) matrices have all positive eigenvalues,
xT Ax > 0 ∀x ̸= 0
A−1 = QΛ−1 QT