MLF Notes - Rishab Dec 24
MLF Notes - Rishab Dec 24
1 Week 1
1. Supervised Learning: Regression
• Find model f such that f (xi ) ≈ g i
• Training Data: (x1 , y 1 ), (x2 , y 2 ), ..., (xn , y n )
Σ(f (xi )−y i )2
• Loss = n
• f (x) = w · x + b
T
3. Validation Data: Choosing the right collection of models is done using validation data.
2 Week 2
1. Continuity & Differentiability
• f : R → R is continuous if limx→x∗ f (x) = f (x∗ )
f (x)−f (x∗ )
• Differentiable if limx→x∗ x−x∗ = f (x′ ) exists.
• if f is NOT continuous =⇒ NOT differentiable
2. Linear Approximation
• If f is differentiable
• f (x) ≈ f (x∗ ) + f ′ (x∗ )(x − x∗ )
• f (x) ≈ Lx∗ [f ](x)
1
• Line through points u and u′ = x, x = u + α(u − u′ )
5. Hyper Planes
• Hyper Plane normal to vector w with value b = x, wT · x = b, where x, wϵRd and bϵR
6. Partial Derivatives & Gradients
• f : Rd → R
• δf
δx (v)
δf
= [ δx 1
δf
(v), δx 2
δf
(v), ..., δx d
(v)]
• ∆f (v) = [ δf
δx ]
T
3 Week 3
1. Four Fundamental Sub Spaces
• Column Space C(A)
– span(u1 , u2 , ..., un ) = Linear Combination of vectors
– If Ax = b has a solution, then bϵC(A)
– Rank = number of pivot columns = dim(C(A))
• Null Space N (A)
– x|Ax = 0
– If A is invertible then N (A) only contains zero, and Ax = b has a unique solution.
– Nullity = number of free variables = dim(N (A))
– If A has n columns, then rank + nullity = n
– Can use Gaussian Elimination to solve for N (A)
• Row Space R(A)
– Column Space of AT
– Column Rank dim(C(A)) = Row Rank dim(R(A))
– R(A) ⊥ N (A)
• Left Null Space N (AT )
– C(A) ⊥ N (AT )
2. Orthogonal and Vector Sub Spaces
• Orthogonal Vectors, x ⊥ y if x · y = xT y = 0
• Orthonormal Vectors, u ⊥ v and u = v = 1
3. Projections
• Projection onto a line
– p = x̂a
– e = b − p = b − x̂a
aT b
– e ⊥ a =⇒ x̂ = aT a
T
– Projection matrix P = aa
aT a
– p = Pb
– P is symmetric, P 2 = P , Rank P = 1
• Projection onto a subspace
2
– Projection of b onto C(A), Ax = b
– p = Ax̂, e = b − Ax̂
– e ⊥ every vector in C(A) and N (AT ) ⊥ C(A) =⇒ eϵN (AT )
– Projection Matrix P = A(AT A)−1 AT , p = P b
4. Least Squares
• Suppose we have a vector b which leads to an inconsistent system Ax ̸= b
• Next best thing we do is minimize average error, E 2 = (Ax − b)2
δE 2
• δx = 0 =⇒ (AT A)x = AT b
4 Week 4
1. Linear and Polynomial Regression
Σ(xT
i −yi )
2
• Minimize Loss L(θ) = 2
• Use least squares method (A A)θ = AT Y T
• Polynomial Regression
– Transformed Features: ŷ(x) = θ0 + θ1 x + θ2 x2 + θm xm = Σθj ϕj (x), ϕj (x) = xj
– ŷ(x) = θT ϕ(x), (AT A)θ = AT Y
– Then Proceed as Linear Regression
• Regularized Loss
T 2 2 2
(x θ−y )
– L̄(θ) = i 2 i + λ θ , Regularized Term = λ θ
– (AT A + λI)θreg = AT Y
– Overfitting → Too small λ
– Underfitting → Too large λ
2. Eigenvalues and Eigenvectors
• Eigenvalue equation Ax = λx
• δu
δt = Au can be solved with solutions of the form u(t) = eλt x if Ax = λx
• (A − λI)x = 0
Characteristic polynomial |A − λI| = 0
Trace of A = Σλ = Sum of diagonal elements of A
|A| = Determinant of A = Πλ
3. Diagonalization of a Matrix
• A matrix A is diagonalizable if there exists an invertible matrix S such that S −1 AS = λ, λ = Diagonal
Matrix
• S = x1 x2 ... xn , x1 , x2 , ..., xn = eigenvectors
• S −1 Ak S = λk , k ≥ 1
• QλQT = A
Q = q1 q2 ... qn
q1 = x1 , q2 = x2 , ..., qn = xn
x1 x2 xn
√
4. Fibonacci Sequence Fk ≈ √1 ( 1+ 5 )k
5 2
5 Week 5
1. Complex Matrices
3
• A∗ =Conjugate Transpose of A = ĀT
2. Hermitian Matrix
• A is unitary diagonalizable if A = U λU ∗
• Any n × n matrix A is similar to an n × n upper triangular matrix, A = U T U ∗
• If U1 = w1 w2 ... is the matrix then take w1 = X1 , first eigenvector then w2 = X2 − w1 ·X2
2 w1
w1
6 Week 6
1. Singular Value Decomposition
• Let A be a real symmetric matrix
Then all eigenvalues of A are real and A is orthogonally diagonalizable
A = QλQT , QT Q = I
• Any real m × n matrix A can be decomposed to SVD form
A(m × n) = Q1 (m × m)Σ(m × n)Q2 (n × n), QT1 Q1 = I, QT2 Q2 = I
Q1 &Q2 are orthogonal
σ1 0 0 ... 0
D 0
• Σ= , where D = 0 σ2 0 ... 0
0 0
0 0 0 ... σr
√
• σi are called singular values and σi = λi
where λi are eigenvalues of AT A and xi areeigenvectors
• Let yi = Aσi ixi
• Q1 = y1 y2 ... ym , where yi are eigenvectors of AAT = Q1 ΣΣT QT1 and
Q2 = x1 x2 ... xm , where xi are eigenvectors of AT A = Q2 ΣT ΣQT2
2. Positive Definite
• A function f that vanishes at (0, 0) and is strictly positive at other points
• For f (x, y) = ax2 + bxy + cy 2 to be positive definite
a, c > 0 and ac > b2
• If ac = b2 then f (x, y) is positive semi-definite(a > 0) or negative semi definite(a < 0)
• If ac < b2 then (0, 0) is saddle point
x a b
• f (x, y) = v Av, where v =
T
and A =
y b c
|A| < 0 =⇒ saddle point, eigenvalues of A are positive if f (x, y) is positive definite
|A| = 0 =⇒ semi definite
4
7 Week 7
1. Principal Component Analysis
• Start with as many features as you can collect, and then find a good subset of features. Project the
data onto a lower dimensional subspace such that Reconstruction error is minimized, Variation of
projected error is maximized.
Pd Pm Pd
• Actual: xi = j=1 (xTi uj )uj , Projected: x̃ = j=1 zij uj + j=m+1 βj uj
Pn
• Loss function J = n1 i=1 ||xi − x̃i ||2
Differentiating and setting to 0 we get zij = xTi uj and βj = x̄Tj uj
• So for a given m dimensional subspace spanned by B = {u1 , u2 , ..., um } the projected data is
Pm Pd
x̃i = j=1 (xTi uj )uj + j=m+1 (x̄T uj )uj
d Pn
Loss J ∗ = j=m+1 uTj Cuj , C = n1 i=1 (xi − x̄)(xi − x̄)T , where {u1 , u2 , ..., ud } are eigenvectors of C.
P
For maximizing variance the maximizer is eigenvector of C corresponding to max eigenvalue, the max
variance is also equal to this eigenvalue.
which is a n by n matrix.
8 Week 8
1. Introduction to Optimization
• Pillars of ML: Linear Algebra, Probability and Optimization.
• We care about finding the ”best” classifier, ”least” loss, ”maximizing” reward
2. Solving an Unconstrained Optimization Problem
• We want to minimize f (x)
• We start with x0 (arbitrary choice), then for t = 0, 1, 2, ..., T we update xt+1 = xt + d, where d is the
direction.
• d = −αf ′ (x), α = STEP SIZE, Gradient Descent converges to local minima
• Convex function: Functions in which local minima ≡ global minima
η 2 d2
• Taylor Series f (x + ηd) = f (x) + ηdf ′ (x) + 2 f ”(x) + ...
• For higher dimensions derivative becomes gradient
′
• Newtons method update rule xn+1 = xn − ff”(x (xn )
n)
, For higher dimension requires computing Hessian
Matrix, If it is not invertible then this method cannot be applied. This method may not converge and
would either enter infinite cycle or converge to saddle point instead of minima. It takes more time per
iteration, is more computationally intensive and memory intensive.
9 Week 9
1. Constrained Optimization
5
• In equality case λ can be negative or positive.
2. Convexity
10 Week 10
1. Properties of convex functions