Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
num_friends = [100, 49, 41, 40, 25] # ... and lots more
For a small enough data set this might even be the best description. But for
a larger data set, this is unwieldy and probably opaque. (Imagine staring at
a list of 1 million numbers.) For that reason we use statistics to distill and
communicate relevant features of our data.
As a first approach you put the friend counts into a histogram using Counter
and
plt.bar()
friend_counts = Counter(num_friends)
xs = range(101) # largest value is 100
ys = [friend_counts[x] for x in xs]# height is just # of friends
plt.bar
(xs,
ys)
plt.axi
s([0,
101, 0,
25])
plt.title("Histogr
am of Friend
Counts")
plt.xlabel("# of
friends")
plt.ylabel("# of
people")
plt.show()
which are just special cases of wanting to know the values in specific
positions:
sorted_values =
sorted(num_friends)
smallest_value =
sorted_values[0] # 1 second_smallest_value
= sorted_values[1] # 1 second_largest_value
= sorted_values[-2] # 49
Usually, we’ll want some notion of where our data is centered. Most
commonly we’ll use the mean (or average), which is just the sum of the data
divided by its count:
def mean(x):
return sum(x)
/ len(x)
mean(num_friends) #
7.333333
If you have two data points, the mean is simply the point halfway between
them. As you add more points, the mean shifts around, but it always
depends on the value of every point.
We’ll also sometimes be interested in the median, which is the middle-most
value (if the number of data points is odd) or the average of the two
middle-most values (if the number of data points is even).
For instance, if we have five data points in a sorted vector x, the median is
x[5 // 2] or x[2]. If we have six data points, we want the average of x[2]
(the third point) and x[3] (the fourth point).
Notice that — unlike the mean — the median doesn’t depend on every
value in your data. For example, if you make the largest point larger (or
the smallest point smaller), the middle points remain unchanged, which
means so does the median.
The median function is slightly more complicated than you might expect,
mostly because of the “even” case:
def median(v):
n = len(v)
sorted_v =
sorted(v)
midpoint =
n // 2
if n % 2 == 1:
return sorted_v[midpoint]
else:
l = midpoint - 1
hi = midpoint
median(num_friends) # 6.0
Clearly, the mean is simpler to compute, and it varies smoothly as our data
changes. If we have n data points and one of them increases by some small
amount e, then necessarily the mean will increase by e / n. (This makes the
mean amenable to all sorts of calculus tricks.) Whereas in order to find the
median, we have to sort our data. And changing one of our data points by a
small amount e might increase the median by e, by some number less than
e, or not at all (depending on the rest of the data).
At the same time, the mean is very sensitive to outliers in our data. If our
friendliest user had 200 friends (instead of 100), then the mean would rise
to 7.82, while the median would stay the same. If outliers are likely to be
bad data (or otherwise unrepresentative of whatever phenomenon we’re
trying to understand), then the mean can sometimes give us a misleading
picture. For example, the story is often told that in the mid-1980s, the
major at the University of North Carolina with the highest average starting
salary was geography, mostly on account of NBA star (and outlier) Michael
Jordan.
A generalization of the median is the quantile, which represents the value
less than which a certain percentile of the data lies. (The median represents
the value less than which 50% of the data lies.)
def quantile(x, p):
return sorted(x)[p_index]
quantile(num_friends, 0.10) # 1
quantile(num_friends, 0.25) # 3
quantile(num_friends, 0.75) # 9
quantile(num_friends, 0.90) # 13
def mode(x):
counts = Counter(x)
max_count = max(counts.values())
if
count ==
max_count]
mode(num_friends) #
1 and 6
Dispersion refers to measures of how spread out our data is. Typically
they’re statistics for which values near zero signify not spread out at all and
for which large values (whatever that means) signify very spread out. For
instance, a very simple measure is the range, which is just the difference
between the largest and smallest elements:
def data_range(x):
data_range(num_friends) # 99
The range is zero precisely when the max and min are equal, which can only
happen if the elements of x are all the same, which means the data is as
undispersed as possible.
Conversely, if the range is large, then the max is much larger than the min and
the data is more spread out.
Like the median, the range doesn’t really depend on the whole data set. A data
set whose points are all either 0 or 100 has the same range as a data set
whose values are 0, 100, and lots of 50s. But it seems like the first data set
“should” be more spread out.
A more complex measure of dispersion is the variance, which is computed as:
def de_mean(x):
"""translate x by subtracting its mean (so the result has mean 0)"""
x_bar = mean(x)
def variance(x):
n = len(x)
deviations = de_mean(x)
return sum_of_squares(deviations) / (n - 1)
variance(num_friends) # 81.54
Now, whatever units our data is in (e.g., “friends”), all of our measures of
central tendency are in that same unit. The range will similarly be in that
same unit. The variance, on the other hand, has units that are the square of
the original units (e.g., “friends squared”). As it can be hard to make sense
of these, we often look instead at the standard deviation:
def standard_deviation(x):
return
math.sqrt(variance(x))
standard_deviation(num_frien
ds) # 9.03
Both the range and the standard deviation have the same outlier problem
that we saw
earlier for the mean. Using the same example, if our friendliest user had
instead 200 friends, the standard deviation would be 14.89, more than 60%
higher!
A more robust alternative computes the difference between the 75th
percentile value and the 25th percentile value:
def interquartile_range(x):
interquartile_range(num_friends) # 6
def covariance(x,
y): n =
len(x)
If each user had twice as many friends (but the same number of
minutes), the covariance would be twice as large. But in a sense the
variables would be just as interrelated. Said differently, it’s hard to say
what counts as a “large” covariance.
For this reason, it’s more common to look at the correlation, which
divides out the standard deviations of both variables:
stdev_x =
standard_deviat
ion(x) stdev_y
=
standard_deviat
ion(y) if
stdev_x > 0 and
stdev_y > 0:
else:
The person with 100 friends (who spends only one minute per day on the
site) is a huge outlier, and correlation can be very sensitive to outliers. What
happens if we ignore him?
num_friends_good = [x
for i, x in enumerate(num_friends)
if i != outlier]
daily_minutes_good = [x
for i, x in enumerate(daily_minutes)
if i != outlier]
correlation(num_friends_good, daily_minutes_good)
# 0.57
You investigate further and discover that the outlier was actually an
internal test account that no one ever bothered to remove. So you feel
pretty justified in excluding it.
In Figure 6-2, we plot some of these pdfs to see what they look like:
plt.plot(xs,[normal_pdf(x,sigma=0.5) for x in
xs],':',label='mu=0,sigma=0.5') plt.plot(xs,
[normal_pdf(x,mu=-1) for x in
xs],'-.',label='mu=-1,sigma=1') plt.legend()
plt.title("Various
Normal pdfs")
plt.show()
Figure 6-2. Various normal pdfs
plt.title("Various Normal
cdfs") plt.show()
if mu != 0 or sigma != 1:
else:
One reason the normal distribution is so useful is the central limit theorem,
which says (in essence) that a random variable defined as the average of a large
number of independent and identically distributed random variables is itself
approximately normally distributed.
In particular, if are random variables with mean and standard
deviation , and
def bernoulli_trial(p):
return 1 if random.random() < p else 0
histogram = Counter(data)
[v / num_points for v in
histogram.values()], 0.8,
color='0.75')
mu = p * n
xs = range(min(data), max(data) + 1)
for i
in xs]
plt.plot(xs,y
s)
The moral of this approximation is that if you want to know the probability
that (say) a fair coin turns up more than 60 heads in 100 flips, you can
estimate it as the probability that a Normal(50,5) is greater than 60, which is
easier than computing the Binomial(100,0.5) cdf. (Although in most
applications you’d probably be using statistical software that would gladly
compute whatever probabilities you want.)
Exercise
1. Write a program to compute the mean, standard deviation, and variance of all the
columns from the data.csv.
2. Write a NumPy program to compute the covariance matrix of any two arrays from
data.csv.
3. Write a NumPy program to compute the 80th percentile for all elements of
Average Pulse from data.csv.
4. Write a NumPy program to compute the median of flattened given array.
Original array: [[ 0 1 2 3 4 5] , [ 6 7 8 9 10 11]]
5. Plot normal distribution of 1000 random values using seaborn.