0% found this document useful (0 votes)

17 views

6 - Super-Cheatsheet-Mathematics

This document provides a summary of machine learning concepts including probabilities, statistics, classification, deep learning, and conditional probability. It defines key terms like sample space, events, axioms of probability, permutations, combinations, Bayes' rule, partitions, independence, expectations, moments, and characteristic functions. Formulas are given for probability, permutations, combinations, Bayes' rule, expectations, and characteristic functions in both discrete and continuous cases.

Uploaded by

khaled ali

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

6 - Super-Cheatsheet-Mathematics

Uploaded by

khaled ali

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CS 229 – Machine Learning Shervine Amidi & Afshine Amidi

5 Refreshers

5.1 Probabilities and Statistics

Classification
5.1.1 Introduction to Probability and Combinatorics
r Sample space – The set of all possible outcomes of an experiment is known as the sample
space of the experiment and is denoted by S.
r Event – Any subset E of the sample space is known as an event. That is, an event is a set
consisting of possible outcomes of the experiment. If the outcome of the experiment is contained
in E, then we say that E has occurred.

r Axioms of probability – For each event E, we denote P (E) as the probability of event E
Deep learning occuring. By noting E1 ,...,En mutually exclusive events, we have the 3 following axioms:
n
! n
[ X
(1) 0 6 P (E) 6 1 (2) P (S) = 1 (3) P Ei = P (Ei )
i=1 i=1

- Complexify model - Regularize

Remedies - Add more features - Get more data r Permutation – A permutation is an arrangement of r objects from a pool of n objects, in a
- Train longer given order. The number of such arrangements is given by P (n, r), defined as:
n!
P (n, r) =
r Error analysis – Error analysis is analyzing the root cause of the difference in performance (n − r)!
between the current and the perfect models.
r Ablative analysis – Ablative analysis is analyzing the root cause of the difference in perfor- r Combination – A combination is an arrangement of r objects from a pool of n objects, where
mance between the current and the baseline models. the order does not matter. The number of such arrangements is given by C(n, r), defined as:
P (n, r) n!
C(n, r) = =
r! r!(n − r)!

Remark: we note that for 0 6 r 6 n, we have P (n,r) > C(n,r).

5.1.2 Conditional Probability

r Bayes’ rule – For events A and B such that P (B) > 0, we have:
P (B|A)P (A)
P (A|B) =
P (B)

Remark: we have P (A ∩ B) = P (A)P (B|A) = P (A|B)P (B).

r Partition – Let {Ai , i ∈ [[1,n]]} be such that for all i, Ai 6= ∅. We say that {Ai } is a partition
if we have:
n
[
∀i 6= j, Ai ∩ Aj = ∅ and Ai = S
i=1

n
X
Remark: for any event B in the sample space, we have P (B) = P (B|Ai )P (Ai ).
i=1

Stanford University 12 Fall 2018

CS 229 – Machine Learning Shervine Amidi & Afshine Amidi

r Extended form of Bayes’ rule – Let {Ai , i ∈ [[1,n]]} be a partition of the sample space. r Expectation and Moments of the Distribution – Here are the expressions of the expected
We have: value E[X], generalized expected value E[g(X)], kth moment E[X k ] and characteristic function
ψ(ω) for the discrete and continuous cases:
P (B|Ak )P (Ak )
P (Ak |B) = n
X
P (B|Ai )P (Ai ) Case E[X] E[g(X)] E[X k ] ψ(ω)
i=1 n n n n
X X X X
(D) xi f (xi ) g(xi )f (xi ) xki f (xi ) f (xi )eiωxi
r Independence – Two events A and B are independent if and only if we have: i=1 i=1 i=1 i=1
ˆ +∞ ˆ +∞ ˆ +∞ ˆ +∞
P (A ∩ B) = P (A)P (B) (C) xf (x)dx g(x)f (x)dx xk f (x)dx f (x)eiωx dx
−∞ −∞ −∞ −∞

5.1.3 Random Variables

Remark: we have eiωx = cos(ωx) + i sin(ωx).
r Random variable – A random variable, often noted X, is a function that maps every element
in a sample space to a real line. r Revisiting the kth moment – The kth moment can also be computed with the characteristic
function as follows:
r Cumulative distribution function (CDF) – The cumulative distribution function F ,
1 ∂k ψ
which is monotonically non-decreasing and is such that lim F (x) = 0 and lim F (x) = 1, is E[X k ] =
x→−∞ x→+∞ ik ∂ω k
defined as: ω=0

F (x) = P (X 6 x)
r Transformation of random variables – Let the variables X and Y be linked by some
function. By noting fX and fY the distribution function of X and Y respectively, we have:
Remark: we have P (a < X 6 B) = F (b) − F (a).
dx
fY (y) = fX (x)
r Probability density function (PDF) – The probability density function f is the probability dy
that X takes on values between two adjacent realizations of the random variable.

r Relationships involving the PDF and CDF – Here are the important properties to know r Leibniz integral rule – Let g be a function of x and potentially c, and a, b boundaries that
in the discrete (D) and the continuous (C) cases. may depend on c. We have:
ˆ ˆ b
∂ b ∂b ∂a ∂g
g(x)dx = · g(b) − · g(a) + (x)dx
Case CDF F PDF f Properties of PDF ∂c a ∂c ∂c a ∂c
X X
(D) F (x) = P (X = xi ) f (xj ) = P (X = xj ) 0 6 f (xj ) 6 1 and f (xj ) = 1
r Chebyshev’s inequality – Let X be a random variable with expected value µ and standard
xi 6x j deviation σ. For k, σ > 0, we have the following inequality:
ˆ x ˆ +∞ 1
dF
(C) F (x) = f (y)dy f (x) = f (x) > 0 and f (x)dx = 1 P (|X − µ| > kσ) 6
−∞ dx −∞ k2

5.1.4 Jointly Distributed Random Variables

r Variance – The variance of a random variable, often noted Var(X) or σ 2 , is a measure of the
spread of its distribution function. It is determined as follows:
r Conditional density – The conditional density of X with respect to Y , often noted fX|Y ,
Var(X) = E[(X − E[X])2 ] = E[X 2 ] − E[X]2 is defined as follows:
fXY (x,y)
fX|Y (x) =
r Standard deviation – The standard deviation of a random variable, often noted σ, is a fY (y)
measure of the spread of its distribution function which is compatible with the units of the
actual random variable. It is determined as follows:
r Independence – Two random variables X and Y are said to be independent if we have:
p
σ= Var(X) fXY (x,y) = fX (x)fY (y)

Stanford University 13 Fall 2018

CS 229 – Machine Learning Shervine Amidi & Afshine Amidi

r Marginal density and cumulative distribution – From the joint density probability 5.1.5 Parameter estimation
function fXY , we have:
r Random sample – A random sample is a collection of n random variables X1 , ..., Xn that
Case Marginal density Cumulative function are independent and identically distributed with X.
r Estimator – An estimator θ̂ is a function of the data that is used to infer the value of an
unknown parameter θ in a statistical model.
X XX
(D) fX (xi ) = fXY (xi ,yj ) FXY (x,y) = fXY (xi ,yj )
j xi 6x yj 6y r Bias – The bias of an estimator θ̂ is defined as being the difference between the expected
ˆ ˆ ˆ value of the distribution of θ̂ and the true value, i.e.:
+∞ x y
(C) fX (x) = fXY (x,y)dy FXY (x,y) = fXY (x0 ,y 0 )dx0 dy 0 Bias(θ̂) = E[θ̂] − θ
−∞ −∞ −∞

Remark: an estimator is said to be unbiased when we have E[θ̂] = θ.

r Distribution of a sum of independent random variables – Let Y = X1 + ... + Xn with
X1 , ..., Xn independent. We have: r Sample mean and variance – The sample mean and the sample variance of a random
sample are used to estimate the true mean µ and the true variance σ 2 of a distribution, are
n
Y noted X and s2 respectively, and are such that:
ψY (ω) = ψXk (ω)
n n
k=1 1 X 1 X
X= Xi and s2 = σ̂ 2 = (Xi − X)2
n n−1
i=1 i=1
r Covariance – We define the covariance of two random variables X and Y , that we note σXY
2

or more commonly Cov(X,Y ), as follows:

r Central Limit Theorem – Let us have a random sample X1 , ..., Xn following a given
Cov(X,Y ) , σXY
2
= E[(X − µX )(Y − µY )] = E[XY ] − µX µY distribution with mean µ and variance σ 2 , then we have:
σ

r Correlation – By noting σX , σY the standard deviations of X and Y , we define the correlation X ∼ N µ, √
n→+∞ n
between the random variables X and Y , noted ρXY , as follows:
2
σXY
ρXY =
σX σY 5.2 Linear Algebra and Calculus
Remarks: For any X, Y , we have ρXY ∈ [−1,1]. If X and Y are independent, then ρXY = 0. 5.2.1 General notations
r Main distributions – Here are the main distributions to have in mind:
r Vector – We note x ∈ Rn a vector with n entries, where xi ∈ R is the ith entry:
x1 !
x2
Type Distribution PDF ψ(ω) E[X] Var(X) ..
x= ∈ Rn
n .
xn
X ∼ B(n, p) P (X = x) = px q n−x (peiω + q)n np npq
x
Binomial x ∈ [[0,n]]
(D) r Matrix – We note A ∈ Rm×n a matrix with m rows and n columns, where Ai,j ∈ R is the
µx iω
entry located in the ith row and j th column:
X ∼ Po(µ) P (X = x) = e−µ eµ(e −1) µ µ A1,1 · · · A1,n
!
x!
Poisson x∈N A= . . ∈ Rm×n
.. ..
1 eiωb − eiωa a+b (b − a)2 Am,1 · · · Am,n
X ∼ U (a, b) f (x) =
b−a (b − a)iω 2 12 Remark: the vector x defined above can be viewed as a n × 1 matrix and is more particularly
Uniform x ∈ [a,b] called a column-vector.
2
1 −1
x−µ
1 2
σ2 r Identity matrix – The identity matrix I ∈ Rn×n is a square matrix with ones in its diagonal
(C) X ∼ N (µ, σ) f (x) = √ e2 σ
eiωµ− 2 ω µ σ2 and zero everywhere else:
2πσ
Gaussian x∈R  1 0 ··· 0 
.. .. .
1 1 1 . . .. 
X ∼ Exp(λ) f (x) = λe−λx I= 0

1− iω λ λ2 .. . . ..

Exponential x ∈ R+
λ
. . . 0
0 ··· 0 1

Stanford University 14 Fall 2018

CS 229 – Machine Learning Shervine Amidi & Afshine Amidi

Remark: for all matrices A ∈ Rn×n , we have A × I = I × A = A.

r Diagonal matrix – A diagonal matrix D ∈ Rn×n is a square matrix with nonzero values in ∀i,j, i,j = Aj,i
AT
its diagonal and zero everywhere else:
 d1 0 · · · 0 Remark: for matrices A,B, we have (AB)T = B T AT .

.. .. ..
. .
D= 0 .
r Inverse – The inverse of an invertible square matrix A is noted A−1 and is the only matrix
 
.. .. ..

. . . 0 such that:
0 ··· 0 dn
AA−1 = A−1 A = I
Remark: we also note D as diag(d1 ,...,dn ).
Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1 =
5.2.2 Matrix operations B −1 A−1
r Trace – The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:
r Vector-vector multiplication – There are two types of vector-vector products:
n
• inner product: for x,y ∈ Rn , we have:
X
tr(A) = Ai,i
i=1
n
X
xT y = xi yi ∈ R
Remark: for matrices A,B, we have tr(AT ) = tr(A) and tr(AB) = tr(BA)
i=1
r Determinant – The determinant of a square matrix A ∈ Rn×n , noted |A| or det(A) is
expressed recursively in terms of A\i,\j , which is the matrix A without its ith row and j th
• outer product: for x ∈ Rm , y ∈ Rn , we have:
column, as follows:
x1 y1 ··· x1 yn n
.. ..
X
xy T = ∈ Rm×n det(A) = |A| = (−1)i+j Ai,j |A\i,\j |
. .
xm y1 ··· xm yn j=1

Remark: A is invertible if and only if |A| 6= 0. Also, |AB| = |A||B| and |AT | = |A|.
r Matrix-vector multiplication – The product of matrix A ∈ Rm×n and vector x ∈ Rn is a
vector of size Rm , such that:

aT
 5.2.3 Matrix properties
r,1 x n
..
X
Ax =  = ac,i xi ∈ R m
r Symmetric decomposition – A given matrix A can be expressed in terms of its symmetric
.
T
ar,m x i=1 and antisymmetric parts as follows:

A + AT A − AT
where aT
r,i are the vector rows and ac,j are the vector columns of A, and xi are the entries
A= +
2 2
of x. | {z } | {z }
Symmetric Antisymmetric
r Matrix-matrix multiplication – The product of matrices A ∈ Rm×n and B ∈ Rn×p is a
matrix of size Rn×p , such that:

aT ··· aT
 r Norm – A norm is a function N : V −→ [0, + ∞[ where V is a vector space, and such that
r,1 bc,1 r,1 bc,p n
for all x,y ∈ V , we have:
.. ..
X
AB =  = ac,i bT n×p
∈R
. . r,i
T
ar,m bc,1 ··· T
ar,m bc,p i=1 • N (x + y) 6 N (x) + N (y)

where aT • N (ax) = |a|N (x) for a scalar

r,i , br,i are the vector rows and ac,j , bc,j are the vector columns of A and B respec-
T

tively.
• if N (x) = 0, then x = 0
r Transpose – The transpose of a matrix A ∈ Rm×n , noted AT , is such that its entries are
flipped: For x ∈ V , the most commonly used norms are summed up in the table below:

Stanford University 15 Fall 2018

CS 229 – Machine Learning Shervine Amidi & Afshine Amidi

Norm Notation Definition Use case

∂f (A)

n
X ∇A f (A) =
Manhattan, L1 ||x||1 |xi | LASSO regularization i,j ∂Ai,j
i=1
Remark: the gradient of f is only defined when f is a function that returns a scalar.
v
r Hessian – Let f : Rn → R be a function and x ∈ Rn be a vector. The hessian of f with
u n
uX
Euclidean, L2 ||x||2 t x2i Ridge regularization respect to x is a n × n symmetric matrix, noted ∇2x f (x), such that:
i=1 ∂ 2 f (x)

∇2x f (x) =
! 1 i,j ∂xi ∂xj
n p
X
p-norm, Lp ||x||p xpi Hölder inequality Remark: the hessian of f is only defined when f is a function that returns a scalar.
i=1
r Gradient operations – For matrices A,B,C, the following gradient properties are worth
Infinity, L∞ ||x||∞ max |xi | Uniform convergence having in mind:
i
∇A tr(AB) = B T ∇AT f (A) = (∇A f (A))T

r Linearly dependence – A set of vectors is said to be linearly dependent if one of the vectors ∇A tr(ABAT C) = CAB + C T AB T ∇A |A| = |A|(A−1 )T
in the set can be defined as a linear combination of the others.
Remark: if no vector can be written this way, then the vectors are said to be linearly independent.

r Matrix rank – The rank of a given matrix A is noted rank(A) and is the dimension of the
vector space generated by its columns. This is equivalent to the maximum number of linearly
independent columns of A.

r Positive semi-definite matrix – A matrix A ∈ Rn×n is positive semi-definite (PSD) and

is noted A 0 if we have:

A = AT and ∀x ∈ Rn , xT Ax > 0

Remark: similarly, a matrix A is said to be positive definite, and is noted A 0, if it is a PSD

matrix which satisfies for all non-zero vector x, xT Ax > 0.

r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if

there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have:
Az = λz

r Spectral theorem – Let A ∈ Rn×n . If A is symmetric, then A is diagonalizable by a real

orthogonal matrix U ∈ Rn×n . By noting Λ = diag(λ1 ,...,λn ), we have:

∃Λ diagonal, A = U ΛU T

r Singular-value decomposition – For a given matrix A of dimensions m × n, the singular-

value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m
unitary, Σ m × n diagonal and V n × n unitary matrices, such that:

A = U ΣV T

5.2.4 Matrix calculus

r Gradient – Let f : Rm×n → R be a function and A ∈ Rm×n be a matrix. The gradient of f

with respect to A is a m × n matrix, noted ∇A f (A), such that:

Stanford University 16 Fall 2018

Applied Statistics and Probability For Engineers, 5th Edition
75% (4)
Applied Statistics and Probability For Engineers, 5th Edition
23 pages
A Probability and Statistics Cheatsheet
No ratings yet
A Probability and Statistics Cheatsheet
28 pages
cs109 Final Cheat 3 PDF
No ratings yet
cs109 Final Cheat 3 PDF
13 pages
Cramer Raoh and Out 08
No ratings yet
Cramer Raoh and Out 08
13 pages
Wolter Introduction To Variance Estimation PDF PDF Expert
No ratings yet
Wolter Introduction To Variance Estimation PDF PDF Expert
2 pages
ALL ST218 Lecture Notes
No ratings yet
ALL ST218 Lecture Notes
87 pages
Refresher Probabilities Statistics PDF
No ratings yet
Refresher Probabilities Statistics PDF
3 pages
Cheat Sheet On Probability
No ratings yet
Cheat Sheet On Probability
2 pages
Math Statistics
No ratings yet
Math Statistics
4 pages
Probability Basics
No ratings yet
Probability Basics
19 pages
Random Variables and Process
No ratings yet
Random Variables and Process
31 pages
Slide 2 - 20191
No ratings yet
Slide 2 - 20191
44 pages
Probability
No ratings yet
Probability
69 pages
Stats Cheat Sheet
No ratings yet
Stats Cheat Sheet
28 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Summary Statistics
No ratings yet
Summary Statistics
2 pages
lec23 random variable - Copy
No ratings yet
lec23 random variable - Copy
16 pages
Elements of Probability Theory: 2.1 Probability, Random Variables and Random Matrices
No ratings yet
Elements of Probability Theory: 2.1 Probability, Random Variables and Random Matrices
7 pages
CHP 5
No ratings yet
CHP 5
63 pages
R Variables
No ratings yet
R Variables
9 pages
Probability and Statistics - Cookbook
No ratings yet
Probability and Statistics - Cookbook
28 pages
Probability_FoundationalMathofAI_S24
No ratings yet
Probability_FoundationalMathofAI_S24
7 pages
Probability Review
No ratings yet
Probability Review
12 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
Probability Formula Sheet
No ratings yet
Probability Formula Sheet
11 pages
C0 English
No ratings yet
C0 English
42 pages
MIT14 381F13 Lec1 PDF
No ratings yet
MIT14 381F13 Lec1 PDF
8 pages
Chapter 3: Random Variables: Random Variable Assigns A Numerical Value To Each
No ratings yet
Chapter 3: Random Variables: Random Variable Assigns A Numerical Value To Each
19 pages
Intro To Data Science Lecture 2
No ratings yet
Intro To Data Science Lecture 2
12 pages
Mathematical statistics
No ratings yet
Mathematical statistics
7 pages
Formula
No ratings yet
Formula
7 pages
Formula PDF
No ratings yet
Formula PDF
7 pages
Statistics Presentation 5
No ratings yet
Statistics Presentation 5
49 pages
091 - MA8451, MA6451 Probability and Random Processes - Notes PDF
No ratings yet
091 - MA8451, MA6451 Probability and Random Processes - Notes PDF
79 pages
Probability
No ratings yet
Probability
73 pages
Addis Ababa Science & Technology University Department of Electrical & Computer Engineering
No ratings yet
Addis Ababa Science & Technology University Department of Electrical & Computer Engineering
63 pages
Probability Review Stochastic
No ratings yet
Probability Review Stochastic
23 pages
Distributions and Normal Random Variables
No ratings yet
Distributions and Normal Random Variables
8 pages
Day2 - Session - 2 - Acropolis - NPP
No ratings yet
Day2 - Session - 2 - Acropolis - NPP
55 pages
Chapter 3
No ratings yet
Chapter 3
35 pages
Week 5-8 Short Notes
No ratings yet
Week 5-8 Short Notes
10 pages
Cs229 Probability Review
No ratings yet
Cs229 Probability Review
36 pages
LECT3 Probability Theory
No ratings yet
LECT3 Probability Theory
42 pages
Distribution and Statistical Interference
No ratings yet
Distribution and Statistical Interference
43 pages
Probability and Random Processes 2023
No ratings yet
Probability and Random Processes 2023
43 pages
Advanced Statistics
100% (1)
Advanced Statistics
131 pages
Random Variable (slide)
No ratings yet
Random Variable (slide)
22 pages
ML U3
No ratings yet
ML U3
34 pages
ECN-511 Random Variables 11
No ratings yet
ECN-511 Random Variables 11
106 pages
Basic Statistics and Probability Theory
No ratings yet
Basic Statistics and Probability Theory
45 pages
ps-notes
No ratings yet
ps-notes
62 pages
STAT515 Lecture
No ratings yet
STAT515 Lecture
85 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
Introductory Probability and The Central Limit Theorem
No ratings yet
Introductory Probability and The Central Limit Theorem
11 pages
Revision - Elements or Probability: Notation For Events
No ratings yet
Revision - Elements or Probability: Notation For Events
20 pages
report-endterm
No ratings yet
report-endterm
30 pages
PPT3 - Statistical Models in Simulation
No ratings yet
PPT3 - Statistical Models in Simulation
38 pages
ProbabilityStatistics_Probability2 (1)
No ratings yet
ProbabilityStatistics_Probability2 (1)
11 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Lectures on Measure and Integration
From Everand
Lectures on Measure and Integration
Harold Widom
No ratings yet
The Green Book of Mathematical Problems
From Everand
The Green Book of Mathematical Problems
Kenneth Hardy
4.5/5 (3)
21merz Wuetrich PDF
No ratings yet
21merz Wuetrich PDF
27 pages
Ecom 165 Notes
No ratings yet
Ecom 165 Notes
98 pages
Sparsely Constrained NN
No ratings yet
Sparsely Constrained NN
6 pages
IE4102 Lecture6
No ratings yet
IE4102 Lecture6
39 pages
Chapter 4-Volation Final Last 2018
No ratings yet
Chapter 4-Volation Final Last 2018
105 pages
Some Stats Concepts
No ratings yet
Some Stats Concepts
6 pages
DL QB With Ans
No ratings yet
DL QB With Ans
38 pages
CH6 1Qs
No ratings yet
CH6 1Qs
8 pages
Access the PDF of Marketing Research Methodological Foundations 10th Edition Churchill Solutions Manual immediately with all chapters
100% (10)
Access the PDF of Marketing Research Methodological Foundations 10th Edition Churchill Solutions Manual immediately with all chapters
57 pages
04.sampling Distributions of The Estimators
No ratings yet
04.sampling Distributions of The Estimators
32 pages
The Slope of Regression For Kriging Estimators: Background
No ratings yet
The Slope of Regression For Kriging Estimators: Background
4 pages
Unit 6 Point Estimation: Structure
No ratings yet
Unit 6 Point Estimation: Structure
18 pages
Using Rules of Thumb in Cost Estimating - 1
No ratings yet
Using Rules of Thumb in Cost Estimating - 1
2 pages
Exercise 6
No ratings yet
Exercise 6
8 pages
Chapter 2 Econometric
No ratings yet
Chapter 2 Econometric
28 pages
Paper Nhóm 2 Đã Chỉnh Sửa
100% (1)
Paper Nhóm 2 Đã Chỉnh Sửa
11 pages
Train, K. (2003) - Discrete Choice Methods With Simulation
No ratings yet
Train, K. (2003) - Discrete Choice Methods With Simulation
17 pages
Week 5 D 1
No ratings yet
Week 5 D 1
15 pages
Mathematical Statistics with Applications Student Edition Dennis Wackerly - Explore the complete ebook content with the fastest download
100% (2)
Mathematical Statistics with Applications Student Edition Dennis Wackerly - Explore the complete ebook content with the fastest download
47 pages
OREAS 74a
No ratings yet
OREAS 74a
23 pages
Astm e 562 DSS
100% (2)
Astm e 562 DSS
7 pages
LIMDEP Chapter33 PDF
No ratings yet
LIMDEP Chapter33 PDF
143 pages
ERM-104 Parameter-Risk (Excl Sec 3)
No ratings yet
ERM-104 Parameter-Risk (Excl Sec 3)
10 pages
Thetht
No ratings yet
Thetht
14 pages
Hnd Statistics Nbte Curriculum 64
No ratings yet
Hnd Statistics Nbte Curriculum 64
146 pages
Mathematical Statistics For Economics and Business
No ratings yet
Mathematical Statistics For Economics and Business
17 pages
cs2 Syllabus 2024
No ratings yet
cs2 Syllabus 2024
9 pages

6 - Super-Cheatsheet-Mathematics

Uploaded by

6 - Super-Cheatsheet-Mathematics

Uploaded by

CS 229 – Machine Learning Shervine Amidi & Afshine Amidi

5.1 Probabilities and Statistics

- Complexify model - Regularize

Remark: we note that for 0 6 r 6 n, we have P (n,r) > C(n,r).

5.1.2 Conditional Probability

Remark: we have P (A ∩ B) = P (A)P (B|A) = P (A|B)P (B).

Stanford University 12 Fall 2018

5.1.3 Random Variables

5.1.4 Jointly Distributed Random Variables

Stanford University 13 Fall 2018

Remark: an estimator is said to be unbiased when we have E[θ̂] = θ.

or more commonly Cov(X,Y ), as follows:

Stanford University 14 Fall 2018

Remark: for all matrices A ∈ Rn×n , we have A × I = I × A = A.

where aT • N (ax) = |a|N (x) for a scalar

Stanford University 15 Fall 2018

Norm Notation Definition Use case

r Positive semi-definite matrix – A matrix A ∈ Rn×n is positive semi-definite (PSD) and

Remark: similarly, a matrix A is said to be positive definite, and is noted A  0, if it is a PSD

r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if

r Spectral theorem – Let A ∈ Rn×n . If A is symmetric, then A is diagonalizable by a real

r Singular-value decomposition – For a given matrix A of dimensions m × n, the singular-

5.2.4 Matrix calculus

r Gradient – Let f : Rm×n → R be a function and A ∈ Rm×n be a matrix. The gradient of f

Stanford University 16 Fall 2018

You might also like

Remark: similarly, a matrix A is said to be positive definite, and is noted A 0, if it is a PSD