0% found this document useful (0 votes)
17 views

MLF Notes - Rishab Dec 24

The document outlines a comprehensive Machine Learning curriculum spanning nine weeks, covering topics such as supervised and unsupervised learning, optimization techniques, and matrix theory. Key concepts include regression, classification, dimensionality reduction, principal component analysis, and various optimization methods. Each week builds on foundational mathematical principles, including linear algebra and calculus, to equip learners with essential skills for machine learning applications.

Uploaded by

Recovery Account
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

MLF Notes - Rishab Dec 24

The document outlines a comprehensive Machine Learning curriculum spanning nine weeks, covering topics such as supervised and unsupervised learning, optimization techniques, and matrix theory. Key concepts include regression, classification, dimensionality reduction, principal component analysis, and various optimization methods. Each week builds on foundational mathematical principles, including linear algebra and calculus, to equip learners with essential skills for machine learning applications.

Uploaded by

Recovery Account
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Rishabh Indoria

Machine Learning December 19, 2024


Foundations

1 Week 1
1. Supervised Learning: Regression
• Find model f such that f (xi ) ≈ g i
• Training Data: (x1 , y 1 ), (x2 , y 2 ), ..., (xn , y n )
Σ(f (xi )−y i )2
• Loss = n
• f (x) = w · x + b
T

2. Supervised Learning: Classification


• y i ϵ−1, +1
Σ1(f (xi )̸=y i )
• Loss = n
• f (x) = sign(w · x + b)
T

3. Validation Data: Choosing the right collection of models is done using validation data.

4. Unsupervised Learning: Dimensionality Reduction


• Data = x1 , x2 , ..., xn
• Compress, Explain and Group Data.
′ ′
• Encoder f : Rd → Rd Decoder: f : Rd → Rd
• Goal: g(f (xi )) ≈ xi
2
g(f (xi )) − xi
• Loss: n

5. Unsupervised Learning: Density Estimation


• Probabilistic Model
• P : Rd → R+ that sums to 1
• P (x) is large is xϵ Data and low otherwise
Σ−log(P (xi ))
• Loss: n

2 Week 2
1. Continuity & Differentiability
• f : R → R is continuous if limx→x∗ f (x) = f (x∗ )
f (x)−f (x∗ )
• Differentiable if limx→x∗ x−x∗ = f (x′ ) exists.
• if f is NOT continuous =⇒ NOT differentiable

2. Linear Approximation
• If f is differentiable
• f (x) ≈ f (x∗ ) + f ′ (x∗ )(x − x∗ )
• f (x) ≈ Lx∗ [f ](x)

3. Higher order Approximation


f ′′ (x∗ )
• f (x) ≈ f (x∗ ) + f ′ (x∗ )(x − x∗ ) + 2! (x − x∗ )2 + ...
4. Lines
• Line through point u along vector v = x, x = u + αv, where u, v, xϵRd and αϵR

1
• Line through points u and u′ = x, x = u + α(u − u′ )
5. Hyper Planes
• Hyper Plane normal to vector w with value b = x, wT · x = b, where x, wϵRd and bϵR
6. Partial Derivatives & Gradients
• f : Rd → R
• δf
δx (v)
δf
= [ δx 1
δf
(v), δx 2
δf
(v), ..., δx d
(v)]
• ∆f (v) = [ δf
δx ]
T

7. Multivariate Linear Approximation


• f (x) ≈ f (v) + ∆f (v)T (x − v) = Lv [f ](x)
8. Directional Derivative
• Du [f ](v) = δf
δx (v)
T
· u, at point v along u
9. Direction of steepest ascent
• Find uϵRd , u = 1 & maximize Du [f ](v)
• u = α · ∆f (v)

3 Week 3
1. Four Fundamental Sub Spaces
• Column Space C(A)
– span(u1 , u2 , ..., un ) = Linear Combination of vectors
– If Ax = b has a solution, then bϵC(A)
– Rank = number of pivot columns = dim(C(A))
• Null Space N (A)
– x|Ax = 0
– If A is invertible then N (A) only contains zero, and Ax = b has a unique solution.
– Nullity = number of free variables = dim(N (A))
– If A has n columns, then rank + nullity = n
– Can use Gaussian Elimination to solve for N (A)
• Row Space R(A)
– Column Space of AT
– Column Rank dim(C(A)) = Row Rank dim(R(A))
– R(A) ⊥ N (A)
• Left Null Space N (AT )
– C(A) ⊥ N (AT )
2. Orthogonal and Vector Sub Spaces
• Orthogonal Vectors, x ⊥ y if x · y = xT y = 0
• Orthonormal Vectors, u ⊥ v and u = v = 1
3. Projections
• Projection onto a line
– p = x̂a
– e = b − p = b − x̂a
aT b
– e ⊥ a =⇒ x̂ = aT a
T
– Projection matrix P = aa
aT a
– p = Pb
– P is symmetric, P 2 = P , Rank P = 1
• Projection onto a subspace

2
– Projection of b onto C(A), Ax = b
– p = Ax̂, e = b − Ax̂
– e ⊥ every vector in C(A) and N (AT ) ⊥ C(A) =⇒ eϵN (AT )
– Projection Matrix P = A(AT A)−1 AT , p = P b

4. Least Squares
• Suppose we have a vector b which leads to an inconsistent system Ax ̸= b
• Next best thing we do is minimize average error, E 2 = (Ax − b)2
δE 2
• δx = 0 =⇒ (AT A)x = AT b

4 Week 4
1. Linear and Polynomial Regression
Σ(xT
i −yi )
2
• Minimize Loss L(θ) = 2
• Use least squares method (A A)θ = AT Y T

• Polynomial Regression
– Transformed Features: ŷ(x) = θ0 + θ1 x + θ2 x2 + θm xm = Σθj ϕj (x), ϕj (x) = xj
– ŷ(x) = θT ϕ(x), (AT A)θ = AT Y
– Then Proceed as Linear Regression
• Regularized Loss
T 2 2 2
(x θ−y )
– L̄(θ) = i 2 i + λ θ , Regularized Term = λ θ
– (AT A + λI)θreg = AT Y
– Overfitting → Too small λ
– Underfitting → Too large λ
2. Eigenvalues and Eigenvectors
• Eigenvalue equation Ax = λx
• δu
δt = Au can be solved with solutions of the form u(t) = eλt x if Ax = λx
• (A − λI)x = 0
Characteristic polynomial |A − λI| = 0
Trace of A = Σλ = Sum of diagonal elements of A
|A| = Determinant of A = Πλ
3. Diagonalization of a Matrix
• A matrix A is diagonalizable if there exists an invertible matrix S such that S −1 AS = λ, λ = Diagonal
Matrix
• S = x1 x2 ... xn , x1 , x2 , ..., xn = eigenvectors
 

• S −1 Ak S = λk , k ≥ 1
• QλQT = A
 
Q = q1 q2 ... qn
q1 = x1 , q2 = x2 , ..., qn = xn
x1 x2 xn

4. Fibonacci Sequence Fk ≈ √1 ( 1+ 5 )k
5 2

5 Week 5
1. Complex Matrices

• C n : Complex counter part of Rn


• inner product x · y = x̄T y
x̄T y ̸= ȳ T x
2
x = x̄T x

3
• A∗ =Conjugate Transpose of A = ĀT
2. Hermitian Matrix

• A∗ = A, equivalent of symmetric matrices in complex


• All Eigenvectors are real and orthogonal
3. Unitary Matrix
• U ∗U = I
• Ux = x
• U −1 = U ∗
• |λ| = 1, where λ is any eigenvalue
4. Diagonalization of Hermitian Matrices

• A is unitary diagonalizable if A = U λU ∗
• Any n × n matrix A is similar to an n × n upper triangular matrix, A = U T U ∗
• If U1 = w1 w2 ... is the matrix then take w1 = X1 , first eigenvector then w2 = X2 − w1 ·X2
 
2 w1
w1

6 Week 6
1. Singular Value Decomposition
• Let A be a real symmetric matrix
Then all eigenvalues of A are real and A is orthogonally diagonalizable
A = QλQT , QT Q = I
• Any real m × n matrix A can be decomposed to SVD form
A(m × n) = Q1 (m × m)Σ(m × n)Q2 (n × n), QT1 Q1 = I, QT2 Q2 = I
Q1 &Q2 are orthogonal
 
  σ1 0 0 ... 0
D 0
• Σ= , where D =  0 σ2 0 ... 0 
0 0
0 0 0 ... σr

• σi are called singular values and σi = λi
where λi are eigenvalues of AT A and xi areeigenvectors
• Let yi = Aσi ixi
• Q1 = y1 y2 ... ym , where yi are eigenvectors of AAT = Q1 ΣΣT QT1 and
 
 
Q2 = x1 x2 ... xm , where xi are eigenvectors of AT A = Q2 ΣT ΣQT2
2. Positive Definite
• A function f that vanishes at (0, 0) and is strictly positive at other points
• For f (x, y) = ax2 + bxy + cy 2 to be positive definite
a, c > 0 and ac > b2
• If ac = b2 then f (x, y) is positive semi-definite(a > 0) or negative semi definite(a < 0)
• If ac < b2 then (0, 0) is saddle point
   
x a b
• f (x, y) = v Av, where v =
T
and A =
y b c
|A| < 0 =⇒ saddle point, eigenvalues of A are positive if f (x, y) is positive definite
|A| = 0 =⇒ semi definite

4
7 Week 7
1. Principal Component Analysis
• Start with as many features as you can collect, and then find a good subset of features. Project the
data onto a lower dimensional subspace such that Reconstruction error is minimized, Variation of
projected error is maximized.
Pd Pm Pd
• Actual: xi = j=1 (xTi uj )uj , Projected: x̃ = j=1 zij uj + j=m+1 βj uj
Pn
• Loss function J = n1 i=1 ||xi − x̃i ||2
Differentiating and setting to 0 we get zij = xTi uj and βj = x̄Tj uj
• So for a given m dimensional subspace spanned by B = {u1 , u2 , ..., um } the projected data is
Pm Pd
x̃i = j=1 (xTi uj )uj + j=m+1 (x̄T uj )uj
d Pn
Loss J ∗ = j=m+1 uTj Cuj , C = n1 i=1 (xi − x̄)(xi − x̄)T , where {u1 , u2 , ..., ud } are eigenvectors of C.
P
For maximizing variance the maximizer is eigenvector of C corresponding to max eigenvalue, the max
variance is also equal to this eigenvalue.

2. PCA in higher dimension


• Suppose D = {x1 , x2 , ..., xn } where xi ϵR and d >> n, it would be easier to handle a n by n matrix
rather than a d by d matrix.
• C = n1 (xi − x̄)(xi − x̄)T = n1
P P T
A A is a d by d matrix.
• Since rank(C)≤ n =⇒ (d − n) eigenvalues are 0, Hence it is enough to find eigenvectors of C = 1
n AA
T

which is a n by n matrix.

8 Week 8
1. Introduction to Optimization
• Pillars of ML: Linear Algebra, Probability and Optimization.
• We care about finding the ”best” classifier, ”least” loss, ”maximizing” reward
2. Solving an Unconstrained Optimization Problem
• We want to minimize f (x)
• We start with x0 (arbitrary choice), then for t = 0, 1, 2, ..., T we update xt+1 = xt + d, where d is the
direction.
• d = −αf ′ (x), α = STEP SIZE, Gradient Descent converges to local minima
• Convex function: Functions in which local minima ≡ global minima
η 2 d2
• Taylor Series f (x + ηd) = f (x) + ηdf ′ (x) + 2 f ”(x) + ...
• For higher dimensions derivative becomes gradient

• Newtons method update rule xn+1 = xn − ff”(x (xn )
n)
, For higher dimension requires computing Hessian
Matrix, If it is not invertible then this method cannot be applied. This method may not converge and
would either enter infinite cycle or converge to saddle point instead of minima. It takes more time per
iteration, is more computationally intensive and memory intensive.

9 Week 9
1. Constrained Optimization

• Minimize f (x) such that g(x) ≤ 0


• To check if any x∗ is a feasible solution we check g(x∗ ) ≤ 0 and NO ”descent direction” should be a
”feasible direction”.
• Descent direction: Any direction that reduces our functions value, d is a descent direction if dT ∇f (x∗ ) <
0.
• Feasible direction: Any direction that takes to a point which satisfies all constraints, d is a feasible
direction if dT ∇g(x∗ ) < 0.
• Necessary condition for optimal solution: ∇f (x∗ ) = −λ∇g(x∗ ), λ is positive

5
• In equality case λ can be negative or positive.
2. Convexity

• A set S ⊆ Rd is a convex set if ∀x1 , x2 ϵS then λx1 + (1 − λ)x2 ϵS


• Intersection of convex sets is also a convex set.
• zϵRd =
P P
λi xi is a convex combination of points in S if λi ≥ 0 and λ = 1.
The set of all such combinations is called Convex Hull(S)
• Euclidean Balls in Rd : {x : ||x||2 ≤ θ} where ||x||2 =
pP
x2i
3. Convex functions
• f : rd → R, Rd : any convex set, define epi(f) = [xz]ϵRd+1 where z ≥ f (x).
f is a convex function if epi(f) is a convex set.
• f is a convex function iff ∀x1 , x2 ϵRd and all λϵ[0, 1]
f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 )
• Assuming f is differentiable the f is convex iff
f (y) ≥ f (x) + (y − x)T ∇f (x)
• If f is twice differentiable, HϵRdxd , Hij = δxδf
i δxj
f is convex iff H is positive semi definite matrix; eigenvalue(H)≥ 0
• If f is a convex function, then all local minima of f are also global minima

10 Week 10
1. Properties of convex functions

• If f and g are both convex then f + g is also convex


• If f is convex and non decreasing and g is convex then f(g()) is also convex
• If f is convex and g is linear then fog is convex
• In general if f and g are convex then fog may not be convex.

2. Analytical Solution for Linear Regression: w = (X T X)− 1(X T y)


3. Constrained Optimization
• minimize f (x) such that h(x) ≤ 0
• Lagrangian function L(x, λ) = f (x) + λh(x), where λ is a scalar.
For h(x) ≤ 0, the maxλ≥0 L(x, λ) = f (x) with λ = 0
For h(x) > 0, the maxλ≥0 L(x, λ) = ∞ with λ = ∞
• minx f (x) = minx maxλ≥0 L(x, λ), the DUAL would be maxλ≥0 minx L(x, λ), where minx L(x, λ) is an
unconstrained problem.
• g(λ) = minx f (x) + λh(x) is a convex

4. Relation between PRIMAL and DUAL


• g(λ∗ ) ≤ f (x∗ ), value at DUAL optimum ≤ value at PRIMAL optimum (WEAK DUALITY)
• if f and h are convex then STRONG DUALITY holds
• KKT conditions for constrained optimization
∇f (x∗ ) + λ∗ ∇h(x∗ ) = 0 Stationary condition
λ∗ h(x∗ ) = 0 Complimentary Slackness condition
h(x∗ ) ≤ 0 PRIMAL feasibility
λ∗ ≥ 0 DUAL feasibility
In general if (x∗ , λ∗ ) satisfies above conditions =⇒ Local Optima

5. Support Vector Machine


• min 12 ||w||2 such that wT xi yi ≥ 1
• Data set: {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )}

You might also like