0% found this document useful (0 votes)
13 views

2.Mathemetical Background

This document provides an overview of linear algebra concepts, including scalars, vectors, matrices, and tensors. It explains the properties and operations associated with these mathematical objects, such as matrix multiplication, transposition, and inversion. Additionally, it discusses norms and their significance in measuring the size of vectors, particularly in the context of machine learning.

Uploaded by

arnooshanajafi26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

2.Mathemetical Background

This document provides an overview of linear algebra concepts, including scalars, vectors, matrices, and tensors. It explains the properties and operations associated with these mathematical objects, such as matrix multiplication, transposition, and inversion. Additionally, it discusses norms and their significance in measuring the size of vectors, particularly in the context of machine learning.

Uploaded by

arnooshanajafi26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

An Overview on

Linear Algebra
Scalars
• A scalar is a single number

• Integers, real numbers, rational numbers, etc.

• We denote it with italic font:

a, n, x
der. We can identify each individual number by its index in that ordering.

• Vectors: A vector is an array o


ypically we give vectors lower case names written in bold typeface, such
s x. The elements of the vector are identified by writing its name in italic
ypeface, with a subscript. The first element of x is x1 , the second element

order. We can identify each indiv


Vectors
x2 and so on. We also need to say what kind of numbers are stored in
he vector. If each element is in R, and the vector has n elements, then the
ector lies in the set formed by taking the Cartesian product of R n times,
Typically we give vectors lower
enoted as Rn . When we need to explicitly identify the elements of a vector,
• Aasvector
e write them a column is a 1-D
enclosed array
in square of numbers:
brackets:

as x. The elements of the vector


2

6
x1
6 x2 7
3

7
(2.1)
typeface, with a subscript. The fi
x = 6 . 7.
.
4 . 5
xn

is x2 and so on. We also need t


We can think of vectors as identifying points in space, with each element
ving the coordinate along a different axis.
• Can be real, binary, integer, etc.
the vector. If each element is in
ometimes we need to index a set of elements of a vector. In this case, we
efine a set containing the indices and write the set as a subscript. For
xample, to access x1 , x3 and x6 , we define the set S = {1, 3, 6} and write
vector lies in the set formed by t
• Example notation for type and size:
S . We use the sign to index the complement of a set. For example x 1 is
he vector containing all elements of x except for x1 , and x S is the vector

denoted as R . When we need to


n
ontaining all of the elements of x except for x1 , x3 and x6 .

atrices: A matrix is a 2-D array of numbers, so each element is identified by


wo indices instead of just one. We usually give matrices upper-case variable
ames with bold typeface, such as A. If a real-valued matrix A has a height
ndices and write the set as a subscript
x6 , we define the set S = {1, 3, 6} and
2
A1,1
A = 4 A2,1
A3,1
Matrices
A1,2
3

x the complement of a set. For example


A2,2 5 ) A> =
A3,2

A1,1
A1,2
A2,1
A2,2
A3,1
A3,2

ents of x except for x1 , and x S is the


The transpose of the matrix can be thought of as a mirror image across the
al.

of x• Aexcept
matrix is a for x1 ,ofxnumbers:
2-D array 3 and x6 .
th column of A. When we need to explicitly identify the elements of a
x, we write them as an array enclosed in square brackets:

Row
ray of numbers, so each element is identifi
A1,1 A1,2
A2,1 A2,2
. (2.2)

. We usually give matrices upper-case va


times we may need to index matrix-valued expressions that are not just
gle letter. In this case, we use subscripts after the expression, but do
Column
h as A. If a real-valued matrix A has a
onvert anything to lower case. For example, f (A)i,j gives element (i, j)
• Example notation for type and shape:
e matrix computed by applying the function f to A.

we say that A 2 R m⇥n . We usually id


ors: In some cases we will need an array with more than two axes. In
eneral case, an array of numbers arranged on a regular grid with a
ble number of axes is known as a tensor. We denote a tensor named “A”

g its name in italic but not bold font, an


this typeface: A. We identify the element of A at coordinates (i, j, k)
riting A .
Tensors
• A tensor is an array of numbers, that may have

• zero dimensions, and be a scalar

• one dimension, and be a vector

• two dimensions, and be a matrix

• or more dimensions.
ations have many useful properties that make mathematical
rtant operation on matrices is the transpose. The transpose of a
more convenient.
CHAPTER 2. LINEAR For
ALGEBRA example, matrix multiplication
mirror image of the matrix across a diagonal line, called the main
is

graphical
A(B depiction
+
Matrix Transpose
ning down and to the right, starting from its upper left corner. See
C)of =
thisAB
operation.
+ AC.We denote the transpose of a (2.6)
A> , and it is defined such that

(A> )i,j = Aj,i . (2.3)


A(BC) = (AB)C. (2.7)
an be thought of as matrices that contain only one column. The
aisvector
not iscommutative 2 (the
therefore a matrix with condition
A1,1 A1,2only
3 one AB
row.

= BA
Sometimes we does not
alar multiplication. 33However, the
5)A = A
dot
A product between two
4 A > A
1,1 A
2,1 3,1
A = A2,1 2,2
A
1,2 A
2,2 3,2
A A3,1 3,2

:
> >
x transpose
Figure 2.1: The y = yof x. (2.8)the
the matrix can be thought of as a mirror image across
main diagonal.
matrix product has a simple form:
the i-th column of A. When we need to explicitly identify the elements of a
> > as >
(AB) = B A an. array enclosed in square brackets: (2.9)
matrix, we write them

A1,1 A1,2
onstrate Eq. 2.8, by exploiting the . that the value (2.2)
fact of
A2,1 A2,2
e defined,
duct A must A
of matrices have theBsame
and is anumber of columns
third matrix as B
C. In has
order
pe m⇥
ned, and B
Anmust is of
have theshape ⇥ p, thenofCcolumns
samen number is of shape
as B ⇥ p.
m has
atrix product
⇥ n and B isjust by placing
of shape n ⇥ two or more
p, then C ismatrices
of shapetogether,
m ⇥ p.
Matrix
C = AB. (Dot) Product
product just by placing two or more matrices together,
(2.4)
C = AB.
ration is defined by (2.4)
X
n is defined by=
Ci,j Ai,k Bk,j . (2.5)
X k
Ci,j = Ai,k Bk,j . (2.5)
andard product kof two matrices is not just a matrix containing
ndividual elements. Such an operation exists and is called the
tdor Hadamard
product product,
of two and is
matrices is denoted
not just as
a matrix
A B. containing
m elements.
dual
between Suchx an
two vectors y=ofm
andoperation exists
the same and• isncalledis the
dimensionality the
. We can think
Hadamard of theand
product, matrix productas
is denoted CA = AB B.as computing
uct between row i of A and column j of B.
en two vectors x and y of the same dimensionality is thep
can think of p
the matrix
34 product Cn= AB as computing
Must
match
etween row i of A and column j of B.
Inverse Matrices
PTER 2. LINEAR ALGEBRA

Identity Matrix
werful tool called matrix
2 inversion
1 0 0
3 that allows us to
for many values of A. 4 0 1 0 5
0 0 1
ersion, we first need to define the concept of an identity
x is a matrix that does not change any vector when we
Figure 2.2: Example identity matrix: This is I3 .
at matrix. We denote the identity matrix that preserves
n . Formally, IAn 2xR+ A , xand
n⇥n
2,1 1 2,2 2 + ··· + A x = b
2,n n 2 (2
n ... (2
8x 2 R , In x = x. (2.20)
Am,1 x1 + Am,2 x2 + · · · + Am,n xn = bm . (2
ity matrix is simple: all of the entries along the main
Matrix-vector product notation provides a more compact representation
the other
ations entries are zero. See Fig. 2.2 for an example.
of this form.
aware that many more exist.
t of useful properties of the matrix product here, but
ough linear algebra notation to write down a system of linear
that many more exist.
Systems of Equations
ear algebra notation
Ax = b to write down a system (2.11)
of linear
a known matrix, b 2 Rm is a known vector, and x 2 Rn is a
riables we would
Ax =like (2.11)
b to solve for. Each element xi of x is one
iables. Each row of A and each element of b provide another
newrite
matrix, b 2 R
Eq. 2.11 as:
m is a known vector,
expands to and x 2 R n is a

we would like to solve for. Each element xi of x is one


A1,: x = b1 (2.12)
Each row of A and each element of b provide another
Eq. 2.11 as: A2,: x = b2 (2.13)
... (2.14)
A1,: x = b1 (2.12)
Am,: x = bm (2.15)
tly, as: A2,: x = b2 (2.13)
. .x.2 + · · · + A1,n xn = b1
A1,1 x1 + A1,2 (2.14)
(2.16)
35
Solving Systems of
Equations
• A linear system of equations can have:

• No solution

• Many solutions

• Exactly one solution: this means multiplication by


the matrix is an invertible function
f the identity matrix is simple: all of the entries along the main
identity matrix is simple: all of the entries along the main
while all of the other entries are zero. See Fig. 2.2 for an example.
all of the other entries are zero. See Fig. 2.2 for an example.
Matrix Inversion
inverse of A is denoted as A , and it is defined as the matrix
1
se of A is denoted as A , and it is defined as the matrix
1

1
A
• Matrix inverse:A = In . (2.21)
1
A A = In . (2.21)
solve Eq. 2.11 by the following steps:
• Solving a system using an inverse:
Eq. 2.11 by the following steps:
Ax = b (2.22)
Ax1 Ax
A = b= A 1
b (2.22)
(2.23)
1 11
A Ax
In x = A bb
= A (2.23)
(2.24)
1
In x = A
• Numerically 36 b
unstable, but useful for abstract (2.24)
analysis
36
Invertibility

• Matrix can’t be inverted if…

• More rows than columns

• More columns than rows

• Redundant rows/columns (“linearly dependent”,


“low rank”)
is given by
!1
X p

|xi |p
Norms ||x||p =
i

for p 2 R, p 1.
• Norms, including the L p norm, are functions mapping vect
Functions that measure how “large” a vector is
values. On an intuitive level, the norm of a vector x measure
• the origintotoa the
Similar point x.
distance More rigorously,
between a norm
zero and the pointis any func
the following properties:
represented by the vector

• f (x) = 0 ) x = 0

• f (x + y)  f (x) + f (y) (the triangle inequality)

• 8↵ 2 R, f (↵x) = |↵|f (x)

The L2 norm, with p = 2, is known as the Euclidean nor


Euclidean distance from the origin to the point identified by
zero and elements that are small but nonzero. In these cases, we turn to a function
that grows at the same rate in all locations, but retains mathematical simplicity:
the L1 norm. The L1 norm may be simplified to
o measure the size of a vector. In machine learning, we
X
ectors using a function called
i
Norms
||x||1 = |xi |.
a norm. Formally, the L
(2.31)

The
• Lp1 norm
normis commonly used in machine learning when the difference between
APTERzero and nonzero
2. LINEAR elements is very important. 1
ALGEBRA !
Every time an element of x moves
X
away from 0 by ✏, the L norm increases by ✏.
1 p
p
||x|| =
We sometimes measure
p
ause it increases very slowly
|x |
the size of the
near the origin. i vector by counting its number of nonzero
In several 0machine learning
elements. Some authors refer to this function as the “L norm,” but this is incorrect
lications, it is important to discriminate
i between elements that are exactly
terminology. The number of non-zero entries in a vector is not a norm, because
and elements that are small but nonzero. In these cases, we turn to a function
scaling the vector by ↵ does not change the number of nonzero entries. The L1
t grows at the same rate in all locations, but retains mathematical simplicity:
norm is often used as anorm:
• Most substitute for the number
p=2 of nonzero entries.
L1 norm. The L1popular L2 norm,
norm may be simplified to
One other norm that commonly arises in machine learning is the 1 norm,
the L norm, are functions mapping vectors to non-n
p X L
also known
• L1 as the
norm, max ||x||
p=1: norm.1 = This|xnorm
i |. simplifies to the absolute value of the
(2.31)
ive level, the norm of a vector x measures the distan
element with the largest magnitude i in the vector,
L1 norm is commonly used in machine
||x||learning when the difference between
nt More
• Max
x. rigorously,
norm, infinite p: a norm is
1 = max |xi |.
any function
o and nonzero elements is very important. Everyi time an element of x moves that
(2.32)
f
ties: Sometimes we may also wish to measure the size of a matrix. In the context
y from 0 by ✏, the L1 norm increases by ✏.
We sometimes measure the
of deep learning, the size
mostofcommon
the vector by to
way counting
do thisitsis number
with theof otherwise
nonzero obscure
ments.Frobenius
Some authors
norm refer to this function as the “L 0 norm,” but this is incorrect
sX
minology. The number of non-zero entries in a vector is not a norm, because
1 > n
A=A . (2.35)
esmachine
often ariselearning
when thealgorithm in terms
entries are of arbitrary
generated matrices,
by some function of
at
endoes
sive (and not depend
less on the order
descriptive) of the arguments.
arealgorithm by someFor example,
by restricting some
es
arise
distance
when
not depend
Special Matrices and
the
measurements,
entries
on the orderwith A
generated
giving
ofi,jthe the distance
arguments.
function
Forfrom
of
point
example,
because distance functions are symmetric.
A
ance
is
=
i,j =
aAvector
A
ices measurements,
j,i
Vectors
need be square.with
with unit
because
It is A i,j giving
possible the distance
to construct
norm:functions are symmetric.
distance
from point
a rectangular
quare diagonal matrices do not have inverses but it is still
j,i
• Unit vector:
ector
them with unit For
cheaply. norm:
||x||a2 =
non-square
1. diagonal matrix D, the
(2.36)
scaling each element of x, and either concatenating some
||x||2 = 1. (2.36)
is taller than it is wide, or discarding some of the last
d a vector y are orthogonal to each other if x y = 0. If both
>
is wider
D norm,
ero thisthan
means it that
is tall.
they are at a 90 degree angle to each
• Symmetric Matrix:
vector
is any
most are orthogonal
ymatrix
n vectors to each
thatbeismutually
may equal other
owniftranspose:
toorthogonal
its x> = 0. If norm.
y nonzero
with both
eorm,
not this
only means
orthogonalthat>
they
but are
also at
havea 90
unitdegree
norm, angle
we to
call each
them
A = A .
n vectors may be mutually orthogonal with nonzero norm. (2.35)
LGEBRA
only
arise•orthogonal
nmatrix when
is the
a square but also
entries
matrix are have
whose unit
generated
rows are norm,
by someweorthonormal
mutually call them
function of
Orthogonal matrix:
ns are depend
s not mutuallyon orthonormal:
the order of the arguments. For example,
nce
ix ismeasurements,
a squareAmatrix>
A =withAAwhose giving
> i,j rows
A = I. arethe distanceorthonormal
mutually from (2.37)
point
= Aj,i because
e mutually A =Adistance
orthonormal:
1
41 ,
> functions are symmetric.
(2.38)
ctor with unit
> norm:>
atrix as an array of elements.
dely used kinds of matrix decomposition is called eigen-
hh two Eigendecomposition
example of the effect of eigenvectors and eigenvalues. Here, we have
we orthonormal
decompose a matrix (1) into a set of eigenvectors
eigenvectors, v with eigenvalue 1 and v with
(2) and
Left) We plot the set of all unit vectors u 2 R2 as a unit circle. (Right)
of all points Au. By observing the way that A distorts the unit circle, we
square
cales
ALGEBRAspacematrix
in A
direction vis
(i) a non-zero vector v such that multipli-
by i .
the scale of v:
form a• matrix V with one eigenvector
Eigenvector and eigenvalue: per column: V = [v (1) , . . . ,

, we can concatenate
Av the eigenvalues
= v. to form a vector = [ 1 , . . . , (2.39)
on exists, but may involve complex rather than real numbers.
ndecomposition of A is then given by
ook, •weEigendecomposition
usually need to decompose of a diagonalizable matrix:
only a specific class of
assimple
the eigenvalue = V corresponding
diag( 1 to this eigenvector.
decomposition. Specifically, every real symmetric
A )V . (2.40) (One
vector • such
Everythat
real
v >
symmetric
A = v , only
but we
>matrix are
has usually
a real, concerned
orthogonal
posed into an expression using real-valued eigenvectors
en that constructing matrices with specific eigenvalues and eigenvec-
. eigendecomposition:
to stretch space in desired directions. However, we often want to
> eigenvectors. Doing so can help us
rices into their
or of A, thenAso eigenvalues
= is
Q⇤Q and
any rescaled
, vector sv for s 2 R, (2.41)
s 6= 0.
ain properties of the matrix, much as decomposing an integer into
he same
gonal eigenvalue.
matrix composed For this
of reason,
eigenvectors weofusually
rs can help us understand the behavior of that integer. A, and only
⇤ look
is a
eigenvalue
matrix ⇤i,i is associated
can be decomposed with the
into eigenvalues eigenvectorIninsome
and eigenvectors. column i
Effect of eigenvectors and eigenvalues

Effect of Eigenvalues
Before multiplication After multiplication
3 3

λ1 v(1)
2 2

1 1
v(1) v(1)

0 0
x1

x1′
λ1 v(1)
v(1) v(1)
−1 −1

−2 −2

−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x( x(′
LINEAR ALGEBRA

Singular Value
ly applicable. Every real matrix has a singular value decomposition,
is not true of the eigenvalue decomposition. For example, if a matrix
Decomposition
, the eigendecomposition is not defined, and we must use a singular
position instead.
at the eigendecomposition involves analyzing a matrix A to discover
f eigenvectors and a vector of eigenvalues such that we can rewrite

1
A = V diag( )V .
• Similar to eigendecomposition (2.42)

ular value decomposition is similar, except this time we will write A


• More general; matrix need not be square
of three matrices:

A = U DV > . (2.43)

hat A is an m ⇥ n matrix. Then U is defined to be an m ⇥ m matrix,


m ⇥ n matrix, and V to be an n ⇥ n matrix.
rithms for computing the pseudoinverse are not based on this defini-
When A
er the formula
has
Moore-Penrosemore columns than r
pseudoinverse provides one of the ma
+ + >
A =VD U , (2.47)

Pseudoinverse
nd V are the singular value decomposition of A, and the pseudoinverse
he solution
onal matrix D is obtainedx =
by taking A
the
+ y with
reciprocal
taking the transpose of the resulting matrix.
minima
of its non-zero

olutions.
has more columns than rows, then solving a linear equation using the
provides Ifone
• theofequation
the many has:
possible solutions. Specifically, it provides
x = A+ y with minimal Euclidean norm ||x||2 among all possible
When• Exactly A has more
one solution: this is the rows
same as thethan colu
inverse.

n this Nocase, using the pseudoinverse


has more rows than columns, it is possible for there to be no solution.
using solution:gives
the pseudoinverse
• this gives
us usfor
the x thewhich
solution
Ax iswith the as
as close
in terms of smallest
Euclidean error
norm ||Ax y|| .
possible to y in terms of Euclidean n
2

Many solutions: this gives us the solution with the



e Trace smallest
Operator norm of x.
rator gives the sum of all of the diagonal entries of a matrix:
eudoinverse allows us to make some headway in these
Computing the
of A is defined as a matrix
+ >
Pseudoinverse
1 >
A = lim (A A + ↵I) A . (2.46)
↵&0

mputing the pseudoinverse are not based on this defini-


la The SVD allows the computation of the pseudoinverse:

+ + >
A =VD U , (2.47)

singular value decomposition of A, and the pseudoinverse


Take reciprocal of non-zero entries
D is obtained by taking the reciprocal of its non-zero
ranspose of the resulting matrix.
mns than rows, then solving a linear equation using the
of the many possible solutions. Specifically, it provides
pression using many useful identities. For example, the trace
perator
nt to the transpose operator:

sum of all of the =


Tr(A) diagonal Trace
Tr(A ). entries of a matrix: (2.50)
>

X
square matrix
Tr(A) = composed
Ai,i . of many factors is also invariant
(2.48)to
ctor into the first
i
position, if the shapes of the corresponding
resulting product to be defined:
ful for a variety of reasons. Some
Tr(ABC) = Tr(CAB) = Tr(BCA) operations that are
(2.51)
sorting to summation notation can be specified using
n
Y 46 n
Y1
(i) (n) (i)
Tr( F ) = Tr(F F ). (2.52)
i=1 i=1

cyclic permutation holds even if the resulting product has a


r example, for A 2 R m⇥n and B 2 R n⇥m , we have
Learning linear algebra

• Do a lot of practice problems

• Start out with lots of summation signs and indexing


into individual entries

• Eventually you will be able to mostly use matrix and


vector product notation quickly and easily
An Overview on
Probability and Statistics
Derivative and Gradient
• A derivative f of a function f is a function or a value that

describes how fast f grows (or decreases)

• The process of finding a derivative is called differentiation

• If the function we want to differentiate is not basic, we can find


its derivative using the chain rule.

• For example

• if F(x) = (5x + 1)^2 then g(x) = 5x + 1 and f(g(x)) = (g(x))^2

• F ′(x) = 2(5x + 1)g′(x) = 2(5x + 1)5 = 50x + 10


ng a derivative
ance if F (x) Õ= is called differentiation.
f (g(x)), where f and g are some functions, then F Õ (x) =
2
f (g(x))g (x). For
Õ
2 example if F (x) = (5x + 1) then g(x)2= 5x + 1 and f (g(x)) = (g(x
cxample if F are
functions (x) =
By applying
(5x + For
known. 1) then
the chain
g(x)if=f5x
example
rule, we (x)+=1 xand
find F (x)
f (g(x))
Õ2 , then
= 2(5x
f Õ = (g(x))
(x)+ = 2x;
1)g Õ
(x) if.= 2(5x + 1)5 = 50x + 10.
in rule, we find Õ
(x) = Õ 2(5x + 1)g Õ (x) = 2(5x + 1)5 = 50x + 10.
) = 2; if f (x) = 2 then f (x) = 0 (the derivative of any function f (x) = c,
F
t value, is Gradient is the generalization of derivative for functions that take several inputs (or o
zero).
neralization of derivative for functions that take several inputs (or one
input in the form of a vector or some other complex structure). A gradient of a function
a vector
want or some other
to differentiate complex structure). A gradient of a functionthe is
a vector ofispartial
not basic, we can find
derivatives. Youitscan derivative
look at using
finding a partial derivative of a funct
derivatives.
tance if F•(x) =You
Gradient can look
(g(x)), where
is atthe
finding
and aare
partial
some
generalization derivative
functions, of of
then a function
derivative
Õ
(x) = of the forfunction’s inputs and
as the process of
f finding
f the
g derivative by focusing Fon one
ding theifderivative
xample (x) = (5x
Fconsidering by+focusing
1) 2
thenong(x)
one=of5x the+ function’s
1 and inputs
(g(x)) = and by2 .
(g(x))
functions all that
other take
inputs as constantinputs
several values. (or one input in the
f
ininputs
rule, as
we constant
find F (x)
Õ values.
= 2(5x + 1)g (x) = 2(5x + 1)5 = 50x + 10.
Õ
(1)
For
form example,
of a ifvector
our (2)
(1) function
or is
(1)defined
some , x(2) ]) = ax
as f ([xcomplex
other
(2)
(1)
+ bx(2) + c, then the part
structure)
function is defined
neralization of as f ([xfor, functions
derivative x ]) = axthat+take +(1)
bx several c, then
inputsthe (or
partial one
derivative(1) of function f with respect to x , denoted as ˆf
(1) , is given by,
nafvector
with respect to , denoted
or some other complex structure).
x as ˆf
ˆx(1)
, is given by,
A gradient of a function is ˆx

derivatives.
• A You can look of
gradient at finding
a functiona partialisderivative of a function
ˆf a vector of partial
ding the derivative ˆf by focusing on one of the function’s = a +inputs
0+0= and a, by
derivatives. = a + 0 + 0 = a, ˆx (1)
inputs as constantˆx (1) values.
where is the derivative
(1) (2) of the (2) ax(1) ; the two zeroes are respectively derivatives
(1) function
function is defined
ative of the function a as
axf ([x
(1)
; the, xtwo ]) = ax
zeroes are+ bx + c, then
respectively the partial
derivatives of
nx(2)
f with (2)
•respect
For to
bx
is considered andx c, ,because
(1)
example
constant denoted
when for(2)
is
as the
xwe considered
, is given
ˆx(1) function
ˆf
compute the by, f([xwhen
constant
derivative , x2])respect
1with we=compute
ax1 + bthe x2 + c
derivative with resp
(1)
to x
vative of any constant, andisthezero.derivative of any constant is zero.
ˆf the partial derivative(2)
Similarly, of function with respect to (2)
, (2) , is given by,
ˆf
l derivative •of function = af + 0
with + 0 =
respecta, to x , ˆf
, is f
given by, x ˆx
ˆx(1) ˆx(2)

ˆf
ˆf ax(1) ; the two zeroes are respectively
ative of the function = 0 + derivatives
b + 0 = b. of
• = 0 + b + 0 = b. (2)
x(2) is considered constant when we compute the
ˆx(2) ˆx derivative with respect
vative of any constant is zero.
The gradient of function ,
tion f , denoted as Òf is given by the vector
f denoted [ as
ˆf Òf
, ˆfis given by the vector [
].
ˆf
ˆx (1) , ˆf
ˆx (2) ].
l derivative of function f with respect to x(2) , ˆx (2) , is given by,
(1)
ˆf ˆx (2)

Thederivatives
s with partial chain ruletoo,
works
as Iwith partialinderivatives
illustrate Chapter 4.too, as I illustrate in Chapter 4.
ˆx

ˆf
Random Variables
• A random variable, is a variable whose possible values are
numerical outcomes of a random phenomenon.

• Examples:

• toss of a coin (0 for heads and 1 for tails),

• a roll of a dice,

• or the height of the first stranger you meet outside

• There are two types of random variables: discrete and


continuous.
Discrete Random Variable
• A discrete random variable takes on only a countable number of distinct values such as red, yellow, blue or 1, 2, 3, ..

• The probability distribution of a discrete random variable is described by a list of probabilities associated with each
of its possible values.

• This list of probabilities is called a probability mass function (pmf).

• For example:

• Pr(X = red) = 0.3, Pr(X = yellow) = 0.45, Pr(X = blue) = 0.25.

• Each probability in a probability mass function is a value greater than or equal to 0. The sum of probabilities
equals 1


Continuous Random
Variable
• A continuous random variable (CRV) takes an infinite number of possible
values in some interval.

• Examples include height, weight, and time.

• Because the number of values of a continuous random variable X is infinite, the


probability Pr(X = c) for any c is 0.

• Instead of the list of probabilities, the probability distribution of a CRV (a


continuous probability distribution) is described by a probability density
function (pdf).
def
E[X] = [xi · Pr(X = xi )]
i=1
(1)
= x1 · Pr(X = x1 ) + x2 · Pr(X = x2 ) + · · · + xk · Pr(X = xk ),

Expectation
where Pr(X = xi ) is the probability that X has the value xi according to the pmf. The
expectation of a random variable is also called the mean, average or expected value
and is frequently denoted with the letter µ. The expectation is one of the most important
statistics of a random variable.
Another important statistic is the standard deviation, defined as,
• Let a discrete random variable X have k possible values
 def
denoted as E[X] is given by, ‡ = E[(X ≠ µ)2 ].
• The expectation
denoted of X by,
as E[X] is given denoted as E[X] is given by
2
Variance, denoted
k
as ‡ or var(X), is defined as,
def
ÿ
E[X] = def [xi · Pr(X = xi )]
ÿk
E[X] = [xi · Pr(X = xi )] (1)
i=1 ‡ 2 = E[(X ≠ µ)2 ]. (1)
i=1
= x1=· xPr(X =x
1 · Pr(X =1x) 1+
) +x2x2· ·Pr(X = xx22))++· ····+· +
Pr(X = · Pr(X
xkx· kPr(X = xk=
), xk ),
For a discrete random variable, the standard deviation is given by:
where where
Pr(X Pr(X
= xi )=isxi the
) is the probabilitythat
probability that X has the
X has thevalue
valuexixaccording to the
i according to pmf. The The
the pmf.
expectation
expectation ‡ of of
= a Pr(X a random
random )(xvariable
= x1variable 2 isPr(X
1 ≠ µ) is+ also called
also called the
= x2 )(x
the
2≠ mean,
µ)2 + ·average
mean, + Pr(X
· ·average expected
or = )(xk ≠ value
µ)2 , value
orxkexpected
and is frequently denoted with the letter µ. The expectation is one of the most important
and is frequently denoted with the letter µ. The expectation is one of the most important
statistics
where µ = of a random variable.
E[X].
statistics of a random variable.
Another
The important
expectation of statistic
a continuous standard
is the random deviation,
variable defined
X is given by,as,
Another important statistic is the standard deviation, defined as,
 ⁄
def
‡ = defE[(X ≠ µ)2 ].
def =
E[X] xfX (x) dx, (2)
2
‡ = E[(X
R ≠ µ)2 ].
Variance, denoted as ‡ or var(X), is defined as,
s
where fX is the pdf2of the variable X and R is the integral of function xfX .
Variance, denoted as ‡ or var(X), is defined as,
2 2
k
ÿ
def
E[X] = [xi · Pr(X = xi )]
Standard Deviation
i=1
= x1 · Pr(X = x1 ) + x2 · Pr(X = x2 ) + · · · + xk · Pr(X = xk ),
(1)

&Variance
where Pr(X = xi ) is the probability that X has the value xi according to the pmf. The
expectation of a random variable is also called the mean, average or expected value
and is frequently denoted with the letter µ. The expectation is one of the most important
statistics of a random variable.
Another important statistic is the standard deviation, defined as,


def
‡ = E[(X ≠ µ)2 ].

Variance, denoted as ‡ 2 or var(X), is defined as,

‡ 2 = E[(X ≠ µ)2 ].

For a discrete random variable, the standard deviation is given by:


‡= Pr(X = x1 )(x1 ≠ µ)2 + Pr(X = x2 )(x2 ≠ µ)2 + · · · + Pr(X = xk )(xk ≠ µ)2 ,

where µ = E[X].
The expectation of a continuous random variable X is given by,

def
E[X] = xfX (x) dx, (2)
R
Unbiased Estimators
2.3 Unbiased Estimators
2.3 Unbiased Estimators
2.3
BecauseUnbiased Estimators
fX is usually unknown, but we have a sample SX = {xi }N i=1 , we often content
2.3 ourselves
Unbiased
Because fnot Estimators
X iswith
usually
the unknown, butof we
true values have a of
statistics sample SX = {xi }N
the probability , we often such
distribution, content
as
i=1
ourselvesfXnotis
expectation,
Because butwith
withthe
usually true
their values
butofestimators.
unbiased
unknown, westatistics of the S
have a sample probability
X = {x
Ndistribution, such as
Ni }i=1 , we often content
Because fX is usually
expectation,
ourselves not but unknown,
with
with thetheir
true but we ofhave
unbiased
values a sample
estimators.
statistics of = {xi }i=1 ,distribution,
SXprobability
the we often content
such Sas
We say that ˆ ) is an unbiased estimator of some statistic calculated using a sample
ourselves not with the true values of statistics of the probability distribution, such as X
expectation,
◊(S X
but with
ˆ unknown their unbiased

estimators.
We sayfrom
drawn thatwith
an
◊(S ) is an unbiased
probability estimator
distributionof some ˆstatistic
if ◊(S calculated
X ) has◊ the following using a sample SX
property:
expectation, but Xtheir unbiased estimators.
drawn
We say from
that ◊(Sˆan X unknown probability
) is an unbiased distribution
estimator of some ˆ X ) has
if statistic
◊(S the following
◊ calculated usingproperty:
a sample SX
We saydrawn
that ◊(S ˆ X )anis unknown
an unbiased estimator of some statistic
Ë È calculated
ˆ X )◊ has using property:
a sample SX
from probability distribution
ˆ if
E Ë◊(SX ) ˆÈ = ◊, ◊(S the following
drawn from an unknown probability distribution ˆ X if ◊(SX ) has the following property:
EË ◊(S )È = ◊,
ˆ È X ) = ◊,
where ◊ˆ is a sample statistic, obtained Ë E ◊(S using a sample SX and not the real statistic ◊ that
ˆ ) = ◊,
where
can be ◊ˆobtained
is a sample only statistic,
knowing fX E
obtained◊(Sexpectation
; the Xusing a sample
is taken and all
SX over notpossible
the real samples
statistic drawn
◊ that
can be
from
where X.◊ˆobtained
isIntuitively,
a sample onlythis
knowing
meansfobtained
statistic, that
X ; theif expectation
you
usingcana have is an
sample taken overnot
Sunlimited
X and allnumber
possible
the samples
realofstatistic ◊drawn
such samples
that
ˆ isSbe
◊from
as
where can ,sample
aX.
X and Intuitively,
obtained only this
you statistic,
compute means
some
knowing Xthat
funbiased
obtained ; the ifexpectation
using you a can
estimator,
samplehave an unlimited
issuch
Staken as µ̂,
X andovernot thenumber
using
all each
possible of
sample,such◊then
samples
real statistic samples
the
drawn
that
as
average
can befrom
obtained
SXX. , and you
ofIntuitively,
all
only compute
these µ̂this
knowing fsome
equals
means theunbiased
X ; the
real
that if you
expectation estimator,
statistic can that
µishave such
takenyou as µ̂,
would
an over allusing
unlimited get eachsamples
computed
number
possible sample,
on X.
of such then the
samples
drawn
average
as X , be of all
and youthese µ̂ equals
compute some theunbiased
ifreal statistic µ thatsuch
estimator, you as
would get computed
using eachofby
sample,onsamples
then
from X. SIntuitively, this means that you can have
of ananunknown
unlimited number such eq. 1the
µ̂,E[X] X.
It can shown that an unbiased estimator (given either or
average
as SX ,eq.
Itand of all these equals the real statistic that you would get computed on
2) you compute 1 some
xunbiased estimator, of such as µ̂, using
E[X]each sample, theneq.
the1 or
µ̂q µ X.
can isbegiven
shown by that an unbiased
N
i (calledestimator
in statistics an
theunknown
sample mean). (given by either
N qi=1
average eq.of
It can allbe
2) isthese
given
shown µ̂byequals
1
that anthe
N
xreal statistic
i (called
unbiased µ that
in statistics
estimator of anyou
the would get
sample
unknown E[X] computed
mean). (given byoneitherX. eq. 1 or
N q i=1
1
eq. 2) is given by i=1 xi (called in statistics the sample mean).
N
It can be shown that anN unbiased estimator of an unknown E[X] (given by either eq. 1 or
2.4 Bayes’ 1
q Rule
eq. 2) is given by i=1 xi (called in statistics the sample mean).
N
2.4 Bayes’ N Rule
2.4 Bayes’ probability
The conditional Rule Pr(X = x|Y = y) is the probability of the random variable X to
The conditional
have probability
a specific value x givenPr(X
that = x|Y =random
another y) is thevariable
probability
Y hasof athe random
specific variable
value of y. X
Theto
2.4 Bayes’
Bayes’
The
Rule
haveconditional
a specific value
Rule (also x given
known
probability as that
Pr(X another
Bayes’
the = random
is the variable
x|Y =Theorem)
y) Y has
stipulates
probability a specific
ofthat:
the randomvalue of y.XThe
variable to
Ë È
ˆ X ) = ◊,
E ◊(S

where ◊ˆ is a sample statistic, obtained using a sample SX and not the real statistic ◊ that
Bayes’ Rule
can be obtained only knowing fX ; the expectation is taken over all possible samples drawn
from X. Intuitively, this means that if you can have an unlimited number of such samples
as SX , and you compute some unbiased estimator, such as µ̂, using each sample, then the
average of all these µ̂ equals the real statistic µ that you would get computed on X.
It
• can
The be shown that an unbiased
conditional estimator
probability of an =
Pr(X unknown
x|Y = y)
E[X] (given
is the by either eq. 1 or
probability
q
eq. 2) is given by N1 i=1 xi (called in statistics the sample mean).
N
of the random variable X to have a specific value x given
that another random variable Y has a specific value of y.
2.4 Bayes’ Rule
• The
The Bayes’
conditional RulePr(X
probability (also known
= x|Y asprobability
= y) is the the Bayes’ of the Theorem)
random variable X to
have a specific value
stipulates x given that another random variable Y has a specific value of y. The
that:
Bayes’ Rule (also known as the Bayes’ Theorem) stipulates that:

Pr(Y = y|X = x) Pr(X = x)


Pr(X = x|Y = y) = .
Pr(Y = y)

2.5 Parameter Estimation

Bayes’ Rule comes in handy when we have a model of X’s distribution, and this model f◊
is a function that has some parameters in the form of a vector ◊. An example of such a
function could be the Gaussian function that has two parameters, µ and ‡, and is defined as:
Some Definitions
• Parameters vs Hyperparameters
• A hyperparameter is a property of a learning algorithm, usually (but not always) having a
numerical value. That value influences the way the algorithm works. Hyperparameters aren’t
learned by the algorithm itself from data.

• Parameters are variables that define the model learned by the learning algorithm.
Parameters are directly modified by the learning algorithm based on the training data.

• Classification vs Regression

• Classification is a problem of automatically assigning a label to an unlabeled


example.

• Regression is a problem of predicting a real-valued label (often called a


target) given an unlabeled example.
Some Definitions (2)

• Model Based vs Instanced-Based Learning

• Shallow Learning vs Deep Learning

You might also like