mean-variance
mean-variance
The mean of a random variable (also called its expected value or first moment) is
defined as
E(X) = ∑ x P (X = x)
x
For example, if X is the number of spots that show up on the roll of a fair die, then
1 1 1 1 1 1 1
E(X) = 1+ 2+ 3+ 4+ 5+ 6=3
6 6 6 6 6 6 2
The mean tells us what the value of X will be on average: sometimes the value will be
larger, sometimes it will be smaller, but the total positive and negative fluctuations will
tend to cancel out over time.
The mean of X has the same units as X itself: e.g., if X is the height of a person in cm,
then E(X) is measured in cm as well.
Scatter plot
In low dimensions we can visualize our data with a scatter plot: we take a sample from
P (X) and plot each of the samples X1 , X2 , … as a single point.
The mean is the center of mass of the scatter plot. While there might not be any
samples exactly on the mean, the displacements of the individual samples away from
the mean tend to add up to zero. (The average displacement would be exactly zero in an
infinitely large sample.)
Linearity of expectation
One of the most useful properties for working with means is linearity of expectation: for
a real-valued random variable X and constants a, b,
E(aX + b) = a E(X) + b
That is, expectation is a linear function from random variables to real numbers, so that we can
pass linear functions into or out of expectations.
For example, since the expected number of spots on a die roll is 3.5, the expected value
of three times the number of spots is 10.5.
We can do this with more complicated expressions too. For example, the solution to the
normal equations XX T w = Xy T depends linearly on y when X is fixed. So if y is a
random variable, then the solution w is also a random variable, and we can write
XX T E(w) = X E(y T ).
Variance
The mean tells us what value our random variable will take on average, but it ignores
how much the variable fluctuates around this average. There are a number of ways to
quantify this fluctuation — that is, to answer how far a random variable will typically be
from its mean. The most widely used ways are a pair of related measurements called the
variance and the standard deviation.
In words, the variance is the expected squared difference between a random variable
and its mean.
For example, suppose X is a fair coin flip: it takes values 0 and 1 with equal probability.
The mean is E(X) = 12 . The difference between X and E(X) is either + 12 or − 12 , with
equal probability; squaring and averaging tells us that V (X) = 14 .
The units of variance are the square of the units of the original random variable. For
example, the variance of a height might be measured in cm2 . This is not very intuitive.
So, it's common to report the standard deviation instead: this is the square root of the
variance, and it is often written σ(X). The standard deviation has the same units as X
and E(X).
The standard deviation is a pretty good match for our intuitive idea of how far a random
variable tends to be from its mean. Its main flaw is that it is sensitive to outliers: that is, if
there is a small probability of encountering a measurement that is very far from the
mean, that will have an outsized effect on the standard deviation.
For example, let's make a low-probability change to our fair coin flip from above.
Suppose that X takes values 0 or 1 with probability 49.95% each; but with the remaining
probability of 0.1%, it takes value 100. The low-probability event causes the mean of X
to move somewhat away from 0.5, to around 0.6. But the variance is now around 10
instead of 14 — a much larger change than the mean.
This sort of sensitivity means that it can be difficult to estimate the variance accurately.
In the example above, if we take fewer than 1000 samples of X , we have a good chance
of not even seeing the large value, and thinking that X is an ordinary fair coin.
After centralizing, it's also common to (approximately) divide by the standard deviation,
so that the standard deviation is (approximately) 1. This is called standardizing or z -
scoring the random variable.
Both of these transformations are sometimes called normalizing the random variable;
this is a less-specific name that encompasses any processing that's intended to remove
some kind of idiosyncracy in a random variable.
Moments
We can measure lots of additional properties of a random variable by looking at E(f (X))
for different functions f . Such expectations are called moments of X .
For example, if
1 if X ∈ [3, 4]
Q(X) = {
0 o/w
Q(X) = e2πitX
for some value of t, then E(Q(X)) tells us whether X has a periodicity of length 2π/t:
this moment will far from zero if there is some value x such that X tends to take values
close to x, x ± 2π/t, x ± 4π/t, …, and close to zero if not (i.e., if X is approximately
aperiodic).
The polynomials are a common and useful source of functions to use in defining
moments. The expectations of the monomials X , X 2 , X 3 , ... are common enough to
have special names: they are called the first, second, third, ... moments.
We can recognize from the definitions that the mean is the first moment. It's common to
remove the mean (centralize the random variable) before taking second, third, and
higher moments, so that we get E((X − E(X))2 ), E((X − E(X))3 ), and so forth. These
are called central moments, and we can recognize from the definitions that the variance
is the second central moment.
The third central moment is called skew, and it measures the symmetry of a distribution:
if X has positive skew it means that large positive values of X are more likely than large
negative ones, while negative skew means the reverse. The fourth central moment is
called kurtosis, and it measures the balance between small and large values of X : if we
keep the variance fixed, a high kurtosis means that we put higher probability on very
large values of X while compensating by making small values smaller.
High order polynomial moments are even more sensitive to outliers than the variance is.
That makes it even harder to estimate them from data. But, if we know lots of moments,
that tells us a lot about a random variable.
Exercise: what is the variance of a biased coin flip, in terms of the probability p of
seeing 1? What about skew or kurtosis?
If the covariance is positive, it means that positive values of X tend to co-occur with
positive values of Y , and negative values of X tend to co-occur with negative values of
Y . If the covariance is negative, it means the reverse: positive values of X are seen with
negative values of Y on average, and vice versa.
If we standardize X and Y , then their covariance is also called the correlation of X and
Y:
X − E(X) Y − E(Y )
Corr(X, Y ) = E ( )
σ(X) σ(Y )
Correlation is always in the range [−1, 1]: a correlation of +1 means that Y is a linear
function of X with positive slope, while a correlation of −1 means that Y is a linear
function of X with negative slope.
Correlation and covariance are not the same; for example, covariance can be bigger
than +1 or smaller than −1. We haven't defined independence yet, but we will see later
that it is separate from both covariance and correlation. If two variables are
independent, they will have zero covariance and zero correlation, but the reverse is not
true.
Exercise: given an example of two random variables that are perfectly dependent (i.e.,
one is a function of the other) but have zero correlation.
If X is a random variable that takes values in some vector space V , then we define E(X)
exactly as before:
E(X) = ∑ x P (X = x)
x
If our vector space is Rn , this is equivalent to taking the mean componentwise. (This
follows from linearity of expectation.)
or, expanding,
By comparing this formula to the scalar case, we can see that the diagonal elements of
V (X) are the variances of the individual components of X , and the off-diagonal
elements are the covariances of pairs of components of X . We'll also sometimes refer to
V (X) as just the variance of X , with the understanding that V (X) is a scalar if X is a
scalar, and a matrix if X is a vector.
Note that the covariance matrix is always symmetric — we can see this from the above
expression, or by noting that the covariance of Xi and Xj doesn't depend on which way
we order them.
Also note that the covariance matrix is always positive semidefinite: from the definition
of variance and linearity of expectation, for any x we have
If X takes values in some inner product space V instead of Rn , we define the variance to be a
linear operator on V : let Y = X − E(X) and set V (X) z = E(Y ⟨Y , z⟩) for each z ∈ V . This
reduces to the above definition if V = Rn .
The simplest possible covariance matrix is the identity, V (X) = I . This means that each
individual coordinate has variance 1, and any two coordinates are uncorrelated. We can
make a sample with variance I by sampling each coordinate independently and
standardizing. NumPy provides a convenient method for doing so:
numpy.random.randn(m, n) makes an m × n matrix of independent random variables
with mean zero and variance 1. If we take each column of this matrix as one of our
samples, we will have the variance matrix equal to I as desired.
By looking at the plot, we can see some information about how the samples tend to vary.
This information shows up as well in the covariance matrix. In the sample shown above,
the covariance matrix is the identity, which results in a perfectly symmetric blob of
points around the origin.
In this section we're showing scatter plots for samples that have zero mean. If the mean were
nonzero they would look the same except for being shifted.
If V (X) is a diagonal matrix, then this sort of stretching or squashing of the individual
axes means that our data distribution will look basically like an ellipsoid, with the axes of
the ellipsoid pointing along the coordinate axes. The diameter of the ellipsoid in each
direction will be proportional to the standard deviation in that direction (the square root
of the variance). Here is a plot where the horizontal axis has variance 5 and the vertical
axis has variance 1.
V (AX) = A V (X)AT
Exercise: do so. (Hint: assume without loss of generality that E(X) = 0; then use the
definition of variance V (X) = E(XX T ).)
There are a lot of uses for this identity. For example, we can use it to generate samples
with any desired covariance matrix Σ, given a factorization of Σ. Say we have the
Cholesky factorization Σ = LLT : we start by generating samples X with covariance I . If
we then transform them to Y = LX , we have
SVD
Putting all of the above together, we can now describe what a scatter plot looks like for a
general covariance matrix. To do so, we'll use the singular value decomposition of V (X):
V (X) = U SU T
Here U is square and orthonormal, and S is diagonal and nonnegative. Note that we've
used the form of the SVD for symmetric positive semidefinite matrices, where we have
the same orthonormal matrix on both the left and the right.
V (U T X) = U T V (X)U = (U T U )S(U T U ) = S
That is, the covariance of Y = U T X is diagonal. That means we understand the shape of
the distribution of Y : its scatter plot will look like an axis-parallel ellipsoid, with the
diameter along axis i proportional to Sii .
So, the scatter plot of X will look like a rotated (and maybe reflected) version of the
scatter plot of Y : that is, it will be an ellipsoid, but not necessarily axis-parallel any more.
Each column Ui will point along one axis of the ellipsoid, and the spread of the ellipsoid
along the direction Ui will be proportional to Sii .
The vectors Ui are called the singular vectors of the covariance matrix. The variances Sii are
called the singular values. For a symmetric PSD matrix, the singular values and singular vectors
are also called the eigenvalues and eigenvectors — though eigenvectors and singular vectors are
different for general matrices. So you will often see a distribution described in terms of the
eigenvectors of its covariance matrix.