2.Mathemetical Background
2.Mathemetical Background
Linear Algebra
Scalars
• A scalar is a single number
a, n, x
der. We can identify each individual number by its index in that ordering.
6
x1
6 x2 7
3
7
(2.1)
typeface, with a subscript. The fi
x = 6 . 7.
.
4 . 5
xn
of x• Aexcept
matrix is a for x1 ,ofxnumbers:
2-D array 3 and x6 .
th column of A. When we need to explicitly identify the elements of a
x, we write them as an array enclosed in square brackets:
Row
ray of numbers, so each element is identifi
A1,1 A1,2
A2,1 A2,2
. (2.2)
• or more dimensions.
ations have many useful properties that make mathematical
rtant operation on matrices is the transpose. The transpose of a
more convenient.
CHAPTER 2. LINEAR For
ALGEBRA example, matrix multiplication
mirror image of the matrix across a diagonal line, called the main
is
graphical
A(B depiction
+
Matrix Transpose
ning down and to the right, starting from its upper left corner. See
C)of =
thisAB
operation.
+ AC.We denote the transpose of a (2.6)
A> , and it is defined such that
:
> >
x transpose
Figure 2.1: The y = yof x. (2.8)the
the matrix can be thought of as a mirror image across
main diagonal.
matrix product has a simple form:
the i-th column of A. When we need to explicitly identify the elements of a
> > as >
(AB) = B A an. array enclosed in square brackets: (2.9)
matrix, we write them
A1,1 A1,2
onstrate Eq. 2.8, by exploiting the . that the value (2.2)
fact of
A2,1 A2,2
e defined,
duct A must A
of matrices have theBsame
and is anumber of columns
third matrix as B
C. In has
order
pe m⇥
ned, and B
Anmust is of
have theshape ⇥ p, thenofCcolumns
samen number is of shape
as B ⇥ p.
m has
atrix product
⇥ n and B isjust by placing
of shape n ⇥ two or more
p, then C ismatrices
of shapetogether,
m ⇥ p.
Matrix
C = AB. (Dot) Product
product just by placing two or more matrices together,
(2.4)
C = AB.
ration is defined by (2.4)
X
n is defined by=
Ci,j Ai,k Bk,j . (2.5)
X k
Ci,j = Ai,k Bk,j . (2.5)
andard product kof two matrices is not just a matrix containing
ndividual elements. Such an operation exists and is called the
tdor Hadamard
product product,
of two and is
matrices is denoted
not just as
a matrix
A B. containing
m elements.
dual
between Suchx an
two vectors y=ofm
andoperation exists
the same and• isncalledis the
dimensionality the
. We can think
Hadamard of theand
product, matrix productas
is denoted CA = AB B.as computing
uct between row i of A and column j of B.
en two vectors x and y of the same dimensionality is thep
can think of p
the matrix
34 product Cn= AB as computing
Must
match
etween row i of A and column j of B.
Inverse Matrices
PTER 2. LINEAR ALGEBRA
Identity Matrix
werful tool called matrix
2 inversion
1 0 0
3 that allows us to
for many values of A. 4 0 1 0 5
0 0 1
ersion, we first need to define the concept of an identity
x is a matrix that does not change any vector when we
Figure 2.2: Example identity matrix: This is I3 .
at matrix. We denote the identity matrix that preserves
n . Formally, IAn 2xR+ A , xand
n⇥n
2,1 1 2,2 2 + ··· + A x = b
2,n n 2 (2
n ... (2
8x 2 R , In x = x. (2.20)
Am,1 x1 + Am,2 x2 + · · · + Am,n xn = bm . (2
ity matrix is simple: all of the entries along the main
Matrix-vector product notation provides a more compact representation
the other
ations entries are zero. See Fig. 2.2 for an example.
of this form.
aware that many more exist.
t of useful properties of the matrix product here, but
ough linear algebra notation to write down a system of linear
that many more exist.
Systems of Equations
ear algebra notation
Ax = b to write down a system (2.11)
of linear
a known matrix, b 2 Rm is a known vector, and x 2 Rn is a
riables we would
Ax =like (2.11)
b to solve for. Each element xi of x is one
iables. Each row of A and each element of b provide another
newrite
matrix, b 2 R
Eq. 2.11 as:
m is a known vector,
expands to and x 2 R n is a
• No solution
• Many solutions
1
A
• Matrix inverse:A = In . (2.21)
1
A A = In . (2.21)
solve Eq. 2.11 by the following steps:
• Solving a system using an inverse:
Eq. 2.11 by the following steps:
Ax = b (2.22)
Ax1 Ax
A = b= A 1
b (2.22)
(2.23)
1 11
A Ax
In x = A bb
= A (2.23)
(2.24)
1
In x = A
• Numerically 36 b
unstable, but useful for abstract (2.24)
analysis
36
Invertibility
|xi |p
Norms ||x||p =
i
for p 2 R, p 1.
• Norms, including the L p norm, are functions mapping vect
Functions that measure how “large” a vector is
values. On an intuitive level, the norm of a vector x measure
• the origintotoa the
Similar point x.
distance More rigorously,
between a norm
zero and the pointis any func
the following properties:
represented by the vector
• f (x) = 0 ) x = 0
The
• Lp1 norm
normis commonly used in machine learning when the difference between
APTERzero and nonzero
2. LINEAR elements is very important. 1
ALGEBRA !
Every time an element of x moves
X
away from 0 by ✏, the L norm increases by ✏.
1 p
p
||x|| =
We sometimes measure
p
ause it increases very slowly
|x |
the size of the
near the origin. i vector by counting its number of nonzero
In several 0machine learning
elements. Some authors refer to this function as the “L norm,” but this is incorrect
lications, it is important to discriminate
i between elements that are exactly
terminology. The number of non-zero entries in a vector is not a norm, because
and elements that are small but nonzero. In these cases, we turn to a function
scaling the vector by ↵ does not change the number of nonzero entries. The L1
t grows at the same rate in all locations, but retains mathematical simplicity:
norm is often used as anorm:
• Most substitute for the number
p=2 of nonzero entries.
L1 norm. The L1popular L2 norm,
norm may be simplified to
One other norm that commonly arises in machine learning is the 1 norm,
the L norm, are functions mapping vectors to non-n
p X L
also known
• L1 as the
norm, max ||x||
p=1: norm.1 = This|xnorm
i |. simplifies to the absolute value of the
(2.31)
ive level, the norm of a vector x measures the distan
element with the largest magnitude i in the vector,
L1 norm is commonly used in machine
||x||learning when the difference between
nt More
• Max
x. rigorously,
norm, infinite p: a norm is
1 = max |xi |.
any function
o and nonzero elements is very important. Everyi time an element of x moves that
(2.32)
f
ties: Sometimes we may also wish to measure the size of a matrix. In the context
y from 0 by ✏, the L1 norm increases by ✏.
We sometimes measure the
of deep learning, the size
mostofcommon
the vector by to
way counting
do thisitsis number
with theof otherwise
nonzero obscure
ments.Frobenius
Some authors
norm refer to this function as the “L 0 norm,” but this is incorrect
sX
minology. The number of non-zero entries in a vector is not a norm, because
1 > n
A=A . (2.35)
esmachine
often ariselearning
when thealgorithm in terms
entries are of arbitrary
generated matrices,
by some function of
at
endoes
sive (and not depend
less on the order
descriptive) of the arguments.
arealgorithm by someFor example,
by restricting some
es
arise
distance
when
not depend
Special Matrices and
the
measurements,
entries
on the orderwith A
generated
giving
ofi,jthe the distance
arguments.
function
Forfrom
of
point
example,
because distance functions are symmetric.
A
ance
is
=
i,j =
aAvector
A
ices measurements,
j,i
Vectors
need be square.with
with unit
because
It is A i,j giving
possible the distance
to construct
norm:functions are symmetric.
distance
from point
a rectangular
quare diagonal matrices do not have inverses but it is still
j,i
• Unit vector:
ector
them with unit For
cheaply. norm:
||x||a2 =
non-square
1. diagonal matrix D, the
(2.36)
scaling each element of x, and either concatenating some
||x||2 = 1. (2.36)
is taller than it is wide, or discarding some of the last
d a vector y are orthogonal to each other if x y = 0. If both
>
is wider
D norm,
ero thisthan
means it that
is tall.
they are at a 90 degree angle to each
• Symmetric Matrix:
vector
is any
most are orthogonal
ymatrix
n vectors to each
thatbeismutually
may equal other
owniftranspose:
toorthogonal
its x> = 0. If norm.
y nonzero
with both
eorm,
not this
only means
orthogonalthat>
they
but are
also at
havea 90
unitdegree
norm, angle
we to
call each
them
A = A .
n vectors may be mutually orthogonal with nonzero norm. (2.35)
LGEBRA
only
arise•orthogonal
nmatrix when
is the
a square but also
entries
matrix are have
whose unit
generated
rows are norm,
by someweorthonormal
mutually call them
function of
Orthogonal matrix:
ns are depend
s not mutuallyon orthonormal:
the order of the arguments. For example,
nce
ix ismeasurements,
a squareAmatrix>
A =withAAwhose giving
> i,j rows
A = I. arethe distanceorthonormal
mutually from (2.37)
point
= Aj,i because
e mutually A =Adistance
orthonormal:
1
41 ,
> functions are symmetric.
(2.38)
ctor with unit
> norm:>
atrix as an array of elements.
dely used kinds of matrix decomposition is called eigen-
hh two Eigendecomposition
example of the effect of eigenvectors and eigenvalues. Here, we have
we orthonormal
decompose a matrix (1) into a set of eigenvectors
eigenvectors, v with eigenvalue 1 and v with
(2) and
Left) We plot the set of all unit vectors u 2 R2 as a unit circle. (Right)
of all points Au. By observing the way that A distorts the unit circle, we
square
cales
ALGEBRAspacematrix
in A
direction vis
(i) a non-zero vector v such that multipli-
by i .
the scale of v:
form a• matrix V with one eigenvector
Eigenvector and eigenvalue: per column: V = [v (1) , . . . ,
, we can concatenate
Av the eigenvalues
= v. to form a vector = [ 1 , . . . , (2.39)
on exists, but may involve complex rather than real numbers.
ndecomposition of A is then given by
ook, •weEigendecomposition
usually need to decompose of a diagonalizable matrix:
only a specific class of
assimple
the eigenvalue = V corresponding
diag( 1 to this eigenvector.
decomposition. Specifically, every real symmetric
A )V . (2.40) (One
vector • such
Everythat
real
v >
symmetric
A = v , only
but we
>matrix are
has usually
a real, concerned
orthogonal
posed into an expression using real-valued eigenvectors
en that constructing matrices with specific eigenvalues and eigenvec-
. eigendecomposition:
to stretch space in desired directions. However, we often want to
> eigenvectors. Doing so can help us
rices into their
or of A, thenAso eigenvalues
= is
Q⇤Q and
any rescaled
, vector sv for s 2 R, (2.41)
s 6= 0.
ain properties of the matrix, much as decomposing an integer into
he same
gonal eigenvalue.
matrix composed For this
of reason,
eigenvectors weofusually
rs can help us understand the behavior of that integer. A, and only
⇤ look
is a
eigenvalue
matrix ⇤i,i is associated
can be decomposed with the
into eigenvalues eigenvectorIninsome
and eigenvectors. column i
Effect of eigenvectors and eigenvalues
Effect of Eigenvalues
Before multiplication After multiplication
3 3
λ1 v(1)
2 2
1 1
v(1) v(1)
0 0
x1
x1′
λ1 v(1)
v(1) v(1)
−1 −1
−2 −2
−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x( x(′
LINEAR ALGEBRA
Singular Value
ly applicable. Every real matrix has a singular value decomposition,
is not true of the eigenvalue decomposition. For example, if a matrix
Decomposition
, the eigendecomposition is not defined, and we must use a singular
position instead.
at the eigendecomposition involves analyzing a matrix A to discover
f eigenvectors and a vector of eigenvalues such that we can rewrite
1
A = V diag( )V .
• Similar to eigendecomposition (2.42)
A = U DV > . (2.43)
Pseudoinverse
nd V are the singular value decomposition of A, and the pseudoinverse
he solution
onal matrix D is obtainedx =
by taking A
the
+ y with
reciprocal
taking the transpose of the resulting matrix.
minima
of its non-zero
olutions.
has more columns than rows, then solving a linear equation using the
provides Ifone
• theofequation
the many has:
possible solutions. Specifically, it provides
x = A+ y with minimal Euclidean norm ||x||2 among all possible
When• Exactly A has more
one solution: this is the rows
same as thethan colu
inverse.
+ + >
A =VD U , (2.47)
X
square matrix
Tr(A) = composed
Ai,i . of many factors is also invariant
(2.48)to
ctor into the first
i
position, if the shapes of the corresponding
resulting product to be defined:
ful for a variety of reasons. Some
Tr(ABC) = Tr(CAB) = Tr(BCA) operations that are
(2.51)
sorting to summation notation can be specified using
n
Y 46 n
Y1
(i) (n) (i)
Tr( F ) = Tr(F F ). (2.52)
i=1 i=1
• For example
derivatives.
• A You can look of
gradient at finding
a functiona partialisderivative of a function
ˆf a vector of partial
ding the derivative ˆf by focusing on one of the function’s = a +inputs
0+0= and a, by
derivatives. = a + 0 + 0 = a, ˆx (1)
inputs as constantˆx (1) values.
where is the derivative
(1) (2) of the (2) ax(1) ; the two zeroes are respectively derivatives
(1) function
function is defined
ative of the function a as
axf ([x
(1)
; the, xtwo ]) = ax
zeroes are+ bx + c, then
respectively the partial
derivatives of
nx(2)
f with (2)
•respect
For to
bx
is considered andx c, ,because
(1)
example
constant denoted
when for(2)
is
as the
xwe considered
, is given
ˆx(1) function
ˆf
compute the by, f([xwhen
constant
derivative , x2])respect
1with we=compute
ax1 + bthe x2 + c
derivative with resp
(1)
to x
vative of any constant, andisthezero.derivative of any constant is zero.
ˆf the partial derivative(2)
Similarly, of function with respect to (2)
, (2) , is given by,
ˆf
l derivative •of function = af + 0
with + 0 =
respecta, to x , ˆf
, is f
given by, x ˆx
ˆx(1) ˆx(2)
ˆf
ˆf ax(1) ; the two zeroes are respectively
ative of the function = 0 + derivatives
b + 0 = b. of
• = 0 + b + 0 = b. (2)
x(2) is considered constant when we compute the
ˆx(2) ˆx derivative with respect
vative of any constant is zero.
The gradient of function ,
tion f , denoted as Òf is given by the vector
f denoted [ as
ˆf Òf
, ˆfis given by the vector [
].
ˆf
ˆx (1) , ˆf
ˆx (2) ].
l derivative of function f with respect to x(2) , ˆx (2) , is given by,
(1)
ˆf ˆx (2)
Thederivatives
s with partial chain ruletoo,
works
as Iwith partialinderivatives
illustrate Chapter 4.too, as I illustrate in Chapter 4.
ˆx
ˆf
Random Variables
• A random variable, is a variable whose possible values are
numerical outcomes of a random phenomenon.
• Examples:
• a roll of a dice,
• The probability distribution of a discrete random variable is described by a list of probabilities associated with each
of its possible values.
• For example:
• Each probability in a probability mass function is a value greater than or equal to 0. The sum of probabilities
equals 1
•
Continuous Random
Variable
• A continuous random variable (CRV) takes an infinite number of possible
values in some interval.
Expectation
where Pr(X = xi ) is the probability that X has the value xi according to the pmf. The
expectation of a random variable is also called the mean, average or expected value
and is frequently denoted with the letter µ. The expectation is one of the most important
statistics of a random variable.
Another important statistic is the standard deviation, defined as,
• Let a discrete random variable X have k possible values
def
denoted as E[X] is given by, ‡ = E[(X ≠ µ)2 ].
• The expectation
denoted of X by,
as E[X] is given denoted as E[X] is given by
2
Variance, denoted
k
as ‡ or var(X), is defined as,
def
ÿ
E[X] = def [xi · Pr(X = xi )]
ÿk
E[X] = [xi · Pr(X = xi )] (1)
i=1 ‡ 2 = E[(X ≠ µ)2 ]. (1)
i=1
= x1=· xPr(X =x
1 · Pr(X =1x) 1+
) +x2x2· ·Pr(X = xx22))++· ····+· +
Pr(X = · Pr(X
xkx· kPr(X = xk=
), xk ),
For a discrete random variable, the standard deviation is given by:
where where
Pr(X Pr(X
= xi )=isxi the
) is the probabilitythat
probability that X has the
X has thevalue
valuexixaccording to the
i according to pmf. The The
the pmf.
expectation
expectation ‡ of of
= a Pr(X a random
random )(xvariable
= x1variable 2 isPr(X
1 ≠ µ) is+ also called
also called the
= x2 )(x
the
2≠ mean,
µ)2 + ·average
mean, + Pr(X
· ·average expected
or = )(xk ≠ value
µ)2 , value
orxkexpected
and is frequently denoted with the letter µ. The expectation is one of the most important
and is frequently denoted with the letter µ. The expectation is one of the most important
statistics
where µ = of a random variable.
E[X].
statistics of a random variable.
Another
The important
expectation of statistic
a continuous standard
is the random deviation,
variable defined
X is given by,as,
Another important statistic is the standard deviation, defined as,
⁄
def
‡ = defE[(X ≠ µ)2 ].
def =
E[X] xfX (x) dx, (2)
2
‡ = E[(X
R ≠ µ)2 ].
Variance, denoted as ‡ or var(X), is defined as,
s
where fX is the pdf2of the variable X and R is the integral of function xfX .
Variance, denoted as ‡ or var(X), is defined as,
2 2
k
ÿ
def
E[X] = [xi · Pr(X = xi )]
Standard Deviation
i=1
= x1 · Pr(X = x1 ) + x2 · Pr(X = x2 ) + · · · + xk · Pr(X = xk ),
(1)
&Variance
where Pr(X = xi ) is the probability that X has the value xi according to the pmf. The
expectation of a random variable is also called the mean, average or expected value
and is frequently denoted with the letter µ. The expectation is one of the most important
statistics of a random variable.
Another important statistic is the standard deviation, defined as,
def
‡ = E[(X ≠ µ)2 ].
‡ 2 = E[(X ≠ µ)2 ].
‡= Pr(X = x1 )(x1 ≠ µ)2 + Pr(X = x2 )(x2 ≠ µ)2 + · · · + Pr(X = xk )(xk ≠ µ)2 ,
where µ = E[X].
The expectation of a continuous random variable X is given by,
⁄
def
E[X] = xfX (x) dx, (2)
R
Unbiased Estimators
2.3 Unbiased Estimators
2.3 Unbiased Estimators
2.3
BecauseUnbiased Estimators
fX is usually unknown, but we have a sample SX = {xi }N i=1 , we often content
2.3 ourselves
Unbiased
Because fnot Estimators
X iswith
usually
the unknown, butof we
true values have a of
statistics sample SX = {xi }N
the probability , we often such
distribution, content
as
i=1
ourselvesfXnotis
expectation,
Because butwith
withthe
usually true
their values
butofestimators.
unbiased
unknown, westatistics of the S
have a sample probability
X = {x
Ndistribution, such as
Ni }i=1 , we often content
Because fX is usually
expectation,
ourselves not but unknown,
with
with thetheir
true but we ofhave
unbiased
values a sample
estimators.
statistics of = {xi }i=1 ,distribution,
SXprobability
the we often content
such Sas
We say that ˆ ) is an unbiased estimator of some statistic calculated using a sample
ourselves not with the true values of statistics of the probability distribution, such as X
expectation,
◊(S X
but with
ˆ unknown their unbiased
◊
estimators.
We sayfrom
drawn thatwith
an
◊(S ) is an unbiased
probability estimator
distributionof some ˆstatistic
if ◊(S calculated
X ) has◊ the following using a sample SX
property:
expectation, but Xtheir unbiased estimators.
drawn
We say from
that ◊(Sˆan X unknown probability
) is an unbiased distribution
estimator of some ˆ X ) has
if statistic
◊(S the following
◊ calculated usingproperty:
a sample SX
We saydrawn
that ◊(S ˆ X )anis unknown
an unbiased estimator of some statistic
Ë È calculated
ˆ X )◊ has using property:
a sample SX
from probability distribution
ˆ if
E Ë◊(SX ) ˆÈ = ◊, ◊(S the following
drawn from an unknown probability distribution ˆ X if ◊(SX ) has the following property:
EË ◊(S )È = ◊,
ˆ È X ) = ◊,
where ◊ˆ is a sample statistic, obtained Ë E ◊(S using a sample SX and not the real statistic ◊ that
ˆ ) = ◊,
where
can be ◊ˆobtained
is a sample only statistic,
knowing fX E
obtained◊(Sexpectation
; the Xusing a sample
is taken and all
SX over notpossible
the real samples
statistic drawn
◊ that
can be
from
where X.◊ˆobtained
isIntuitively,
a sample onlythis
knowing
meansfobtained
statistic, that
X ; theif expectation
you
usingcana have is an
sample taken overnot
Sunlimited
X and allnumber
possible
the samples
realofstatistic ◊drawn
such samples
that
ˆ isSbe
◊from
as
where can ,sample
aX.
X and Intuitively,
obtained only this
you statistic,
compute means
some
knowing Xthat
funbiased
obtained ; the ifexpectation
using you a can
estimator,
samplehave an unlimited
issuch
Staken as µ̂,
X andovernot thenumber
using
all each
possible of
sample,such◊then
samples
real statistic samples
the
drawn
that
as
average
can befrom
obtained
SXX. , and you
ofIntuitively,
all
only compute
these µ̂this
knowing fsome
equals
means theunbiased
X ; the
real
that if you
expectation estimator,
statistic can that
µishave such
takenyou as µ̂,
would
an over allusing
unlimited get eachsamples
computed
number
possible sample,
on X.
of such then the
samples
drawn
average
as X , be of all
and youthese µ̂ equals
compute some theunbiased
ifreal statistic µ thatsuch
estimator, you as
would get computed
using eachofby
sample,onsamples
then
from X. SIntuitively, this means that you can have
of ananunknown
unlimited number such eq. 1the
µ̂,E[X] X.
It can shown that an unbiased estimator (given either or
average
as SX ,eq.
Itand of all these equals the real statistic that you would get computed on
2) you compute 1 some
xunbiased estimator, of such as µ̂, using
E[X]each sample, theneq.
the1 or
µ̂q µ X.
can isbegiven
shown by that an unbiased
N
i (calledestimator
in statistics an
theunknown
sample mean). (given by either
N qi=1
average eq.of
It can allbe
2) isthese
given
shown µ̂byequals
1
that anthe
N
xreal statistic
i (called
unbiased µ that
in statistics
estimator of anyou
the would get
sample
unknown E[X] computed
mean). (given byoneitherX. eq. 1 or
N q i=1
1
eq. 2) is given by i=1 xi (called in statistics the sample mean).
N
It can be shown that anN unbiased estimator of an unknown E[X] (given by either eq. 1 or
2.4 Bayes’ 1
q Rule
eq. 2) is given by i=1 xi (called in statistics the sample mean).
N
2.4 Bayes’ N Rule
2.4 Bayes’ probability
The conditional Rule Pr(X = x|Y = y) is the probability of the random variable X to
The conditional
have probability
a specific value x givenPr(X
that = x|Y =random
another y) is thevariable
probability
Y hasof athe random
specific variable
value of y. X
Theto
2.4 Bayes’
Bayes’
The
Rule
haveconditional
a specific value
Rule (also x given
known
probability as that
Pr(X another
Bayes’
the = random
is the variable
x|Y =Theorem)
y) Y has
stipulates
probability a specific
ofthat:
the randomvalue of y.XThe
variable to
Ë È
ˆ X ) = ◊,
E ◊(S
where ◊ˆ is a sample statistic, obtained using a sample SX and not the real statistic ◊ that
Bayes’ Rule
can be obtained only knowing fX ; the expectation is taken over all possible samples drawn
from X. Intuitively, this means that if you can have an unlimited number of such samples
as SX , and you compute some unbiased estimator, such as µ̂, using each sample, then the
average of all these µ̂ equals the real statistic µ that you would get computed on X.
It
• can
The be shown that an unbiased
conditional estimator
probability of an =
Pr(X unknown
x|Y = y)
E[X] (given
is the by either eq. 1 or
probability
q
eq. 2) is given by N1 i=1 xi (called in statistics the sample mean).
N
of the random variable X to have a specific value x given
that another random variable Y has a specific value of y.
2.4 Bayes’ Rule
• The
The Bayes’
conditional RulePr(X
probability (also known
= x|Y asprobability
= y) is the the Bayes’ of the Theorem)
random variable X to
have a specific value
stipulates x given that another random variable Y has a specific value of y. The
that:
Bayes’ Rule (also known as the Bayes’ Theorem) stipulates that:
Bayes’ Rule comes in handy when we have a model of X’s distribution, and this model f◊
is a function that has some parameters in the form of a vector ◊. An example of such a
function could be the Gaussian function that has two parameters, µ and ‡, and is defined as:
Some Definitions
• Parameters vs Hyperparameters
• A hyperparameter is a property of a learning algorithm, usually (but not always) having a
numerical value. That value influences the way the algorithm works. Hyperparameters aren’t
learned by the algorithm itself from data.
• Parameters are variables that define the model learned by the learning algorithm.
Parameters are directly modified by the learning algorithm based on the training data.
• Classification vs Regression