0% found this document useful (0 votes)

73 views19 pages

Lab Plan 5: Statistics and Probability: Describing A Single Set of Data

This document describes various statistical techniques for summarizing a dataset, including measures of central tendency (mean, median), dispersion (range, variance, standard deviation), and correlation. It provides definitions and code examples for calculating these statistics on a dataset of members' friend counts to describe the distribution to the VP of Fundraising in a concise yet informative way.

Uploaded by

Steffen Cole

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views19 pages

Lab Plan 5: Statistics and Probability: Describing A Single Set of Data

Uploaded by

Steffen Cole

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Lab Plan 5: Statistics and Probability

Describing a Single Set of Data

Through a combination of word-of-mouth and luck, DataSciencester has

grown to dozens of members, and the VP of Fundraising asks you for some
sort of description of how many friends your members have that he can
include in his elevator pitches.
Using techniques from Chapter 1, you are easily able to produce this data.
But now you are faced with the problem of how to describe it.
One obvious description of any data set is simply the data itself:

num_friends = [100, 49, 41, 40, 25] # ... and lots more

For a small enough data set this might even be the best description. But for
a larger data set, this is unwieldy and probably opaque. (Imagine staring at
a list of 1 million numbers.) For that reason we use statistics to distill and
communicate relevant features of our data.
As a first approach you put the friend counts into a histogram using Counter
and
plt.bar()
friend_counts = Counter(num_friends)
xs = range(101) # largest value is 100
ys = [friend_counts[x] for x in xs]# height is just # of friends
plt.bar
(xs,
ys)
plt.axi
s([0,
101, 0,
25])
plt.title("Histogr
am of Friend
Counts")
plt.xlabel("# of
friends")
plt.ylabel("# of
people")
plt.show()

Figure 5-1. A histogram of friend counts

Unfortunately, this chart is still too difficult to slip into conversations. So

you start generating some statistics. Probably the simplest statistic is simply
the number of data points:
num_points = len(num_friends) # 204

You’re probably also interested in the largest and smallest values:

largest_value = max(num_friends) # 100

smallest_value = min(num_friends) # 1

which are just special cases of wanting to know the values in specific
positions:

sorted_values =
sorted(num_friends)
smallest_value =
sorted_values[0] # 1 second_smallest_value
= sorted_values[1] # 1 second_largest_value
= sorted_values[-2] # 49

But we’re only getting started.

Central Tendencies

Usually, we’ll want some notion of where our data is centered. Most
commonly we’ll use the mean (or average), which is just the sum of the data
divided by its count:

# this isn't right if you don't from future import division

def mean(x):

return sum(x)

/ len(x)

mean(num_friends) #

7.333333

If you have two data points, the mean is simply the point halfway between
them. As you add more points, the mean shifts around, but it always
depends on the value of every point.
We’ll also sometimes be interested in the median, which is the middle-most
value (if the number of data points is odd) or the average of the two
middle-most values (if the number of data points is even).
For instance, if we have five data points in a sorted vector x, the median is
x[5 // 2] or x[2]. If we have six data points, we want the average of x[2]
(the third point) and x[3] (the fourth point).
Notice that — unlike the mean — the median doesn’t depend on every
value in your data. For example, if you make the largest point larger (or
the smallest point smaller), the middle points remain unchanged, which
means so does the median.
The median function is slightly more complicated than you might expect,
mostly because of the “even” case:

def median(v):

"""finds the 'middle-most' value of v"""

n = len(v)

sorted_v =
sorted(v)
midpoint =
n // 2

if n % 2 == 1:

# if odd, return the middle value

return sorted_v[midpoint]

else:

# if even, return the average of the middle values

l = midpoint - 1

hi = midpoint

return (sorted_v[lo] + sorted_v[hi]) / 2

median(num_friends) # 6.0

Clearly, the mean is simpler to compute, and it varies smoothly as our data
changes. If we have n data points and one of them increases by some small
amount e, then necessarily the mean will increase by e / n. (This makes the
mean amenable to all sorts of calculus tricks.) Whereas in order to find the
median, we have to sort our data. And changing one of our data points by a
small amount e might increase the median by e, by some number less than
e, or not at all (depending on the rest of the data).

At the same time, the mean is very sensitive to outliers in our data. If our
friendliest user had 200 friends (instead of 100), then the mean would rise
to 7.82, while the median would stay the same. If outliers are likely to be
bad data (or otherwise unrepresentative of whatever phenomenon we’re
trying to understand), then the mean can sometimes give us a misleading
picture. For example, the story is often told that in the mid-1980s, the
major at the University of North Carolina with the highest average starting
salary was geography, mostly on account of NBA star (and outlier) Michael
Jordan.
A generalization of the median is the quantile, which represents the value
less than which a certain percentile of the data lies. (The median represents
the value less than which 50% of the data lies.)
def quantile(x, p):

"""returns the pth-percentile value in x"""

p_index = int(p * len(x))

return sorted(x)[p_index]

quantile(num_friends, 0.10) # 1

quantile(num_friends, 0.25) # 3

quantile(num_friends, 0.75) # 9

quantile(num_friends, 0.90) # 13

Less commonly you might want to look at the mode, or most-common

value[s]:

def mode(x):

"""returns a list, might be more than one mode"""

counts = Counter(x)

max_count = max(counts.values())

return [x_i for x_i, count in counts.iteritems()

count ==

max_count]

mode(num_friends) #

1 and 6

But most frequently we’ll just use the mean.

Dispersion

Dispersion refers to measures of how spread out our data is. Typically
they’re statistics for which values near zero signify not spread out at all and
for which large values (whatever that means) signify very spread out. For
instance, a very simple measure is the range, which is just the difference
between the largest and smallest elements:

# "range" already means something in Python, so we'll use a different name

def data_range(x):

return max(x) - min(x)

data_range(num_friends) # 99

The range is zero precisely when the max and min are equal, which can only
happen if the elements of x are all the same, which means the data is as
undispersed as possible.
Conversely, if the range is large, then the max is much larger than the min and
the data is more spread out.
Like the median, the range doesn’t really depend on the whole data set. A data
set whose points are all either 0 or 100 has the same range as a data set
whose values are 0, 100, and lots of 50s. But it seems like the first data set
“should” be more spread out.
A more complex measure of dispersion is the variance, which is computed as:

def de_mean(x):

"""translate x by subtracting its mean (so the result has mean 0)"""

x_bar = mean(x)

return [x_i - x_bar for x_i in x]

def variance(x):

"""assumes x has at least two elements"""

n = len(x)

deviations = de_mean(x)

return sum_of_squares(deviations) / (n - 1)

variance(num_friends) # 81.54

Now, whatever units our data is in (e.g., “friends”), all of our measures of
central tendency are in that same unit. The range will similarly be in that
same unit. The variance, on the other hand, has units that are the square of
the original units (e.g., “friends squared”). As it can be hard to make sense
of these, we often look instead at the standard deviation:

def standard_deviation(x):

return

math.sqrt(variance(x))

standard_deviation(num_frien

ds) # 9.03

Both the range and the standard deviation have the same outlier problem
that we saw
earlier for the mean. Using the same example, if our friendliest user had
instead 200 friends, the standard deviation would be 14.89, more than 60%
higher!
A more robust alternative computes the difference between the 75th
percentile value and the 25th percentile value:

def interquartile_range(x):

return quantile(x, 0.75) - quantile(x, 0.25)

interquartile_range(num_friends) # 6

which is quite plainly unaffected by a small number of outliers.

Correlation
DataSciencester’s VP of Growth has a theory that the amount of time people
spend on the site is related to the number of friends they have on the site
(she’s not a VP for nothing), and she’s asked you to verify this.
After digging through traffic logs, you’ve come up with a list daily_minutes
that shows how many minutes per day each user spends on
DataSciencester, and you’ve ordered it so that its elements correspond to
the elements of our previous num_friends list. We’d like to investigate the
relationship between these two metrics.
We’ll first look at covariance, the paired analogue of variance. Whereas
variance measures how a single variable deviates from its mean,
covariance measures how two variables vary in tandem from their means:

def covariance(x,
y): n =
len(x)

return dot(de_mean(x), de_mean(y)) / (n - 1)

covariance(num_friends, daily_minutes) # 22.43

Recall that dot sums up the products of corresponding pairs of elements.

When corresponding elements of x and y are either both above their means
or both below their means, a positive number enters the sum. When one is
above its mean and the other below, a negative number enters the sum.
Accordingly, a “large” positive covariance means that x tends to be large
when y is large and small when y is small. A “large” negative covariance
means the opposite — that x tends to be small when y is large and vice
versa. A covariance close to zero means that no such relationship exists.
Nonetheless, this number can be hard to interpret, for a couple of reasons:
Its units are the product of the inputs’ units (e.g., friend-minutes-per-day),
which can be hard to make sense of. (What’s a “friend-minute-per-day”?)

If each user had twice as many friends (but the same number of
minutes), the covariance would be twice as large. But in a sense the
variables would be just as interrelated. Said differently, it’s hard to say
what counts as a “large” covariance.

For this reason, it’s more common to look at the correlation, which
divides out the standard deviations of both variables:

def correlation(x, y):

stdev_x =
standard_deviat
ion(x) stdev_y
=
standard_deviat
ion(y) if
stdev_x > 0 and
stdev_y > 0:

return covariance(x, y) / stdev_x / stdev_y

else:

return 0 # if no variation, correlation is zero

correlation(num_friends, daily_minutes) # 0.25

The correlation is unitless and always lies between -1 (perfect anti-

correlation) and 1 (perfect correlation). A number like 0.25 represents a
relatively weak positive correlation.
However, one thing we neglected to do was examine our data. Check out
Figure 5-2.
Figure 5-2. Correlation with an outlier

The person with 100 friends (who spends only one minute per day on the
site) is a huge outlier, and correlation can be very sensitive to outliers. What
happens if we ignore him?

outlier = num_friends.index(100) # index of outlier

num_friends_good = [x

for i, x in enumerate(num_friends)

if i != outlier]

daily_minutes_good = [x

for i, x in enumerate(daily_minutes)
if i != outlier]

correlation(num_friends_good, daily_minutes_good)

# 0.57

Without the outlier, there is a much stronger correlation (Figure 5-3).

Figure 5-3. Correlation after removing the outlier

You investigate further and discover that the outlier was actually an
internal test account that no one ever bothered to remove. So you feel
pretty justified in excluding it.

The Normal Distribution

The normal distribution is the king of distributions. It is the classic

bell curve–shaped distribution and is completely determined by two
parameters: its mean (mu) and its standard deviation(sigma). The
mean indicates where the bell is centered, and the standard deviation
how “wide” it is.
It has the distribution function:

which we can implement as:

def normal_pdf(x, mu=0,

sigma=1): sqrt_two_pi
= math.sqrt(2 *
math.pi)

return (math.exp(-(x-mu) 2 / 2 / sigma 2) / (sqrt_two_pi * sigma))

In Figure 6-2, we plot some of these pdfs to see what they look like:

xs = [x / 10.0 for x in range(-50, 50)] plt.plot(xs,

[normal_pdf(x,sigma=1) for x in
xs],'-',label='mu=0,sigma=1') plt.plot(xs,
[normal_pdf(x,sigma=2) for x in
xs],'--',label='mu=0,sigma=2')

plt.plot(xs,[normal_pdf(x,sigma=0.5) for x in
xs],':',label='mu=0,sigma=0.5') plt.plot(xs,
[normal_pdf(x,mu=-1) for x in
xs],'-.',label='mu=-1,sigma=1') plt.legend()

plt.title("Various
Normal pdfs")
plt.show()
Figure 6-2. Various normal pdfs

When and , it’s called the standard normal

distribution. If Z is a standard normal random variable, then it
turns out that:

is also normal but with mean and standard deviation . Conversely,

if X is a normal random variable with mean and standard deviation
,
is a standard normal variable.

The cumulative distribution function for the normal distribution cannot be

written in an “elementary” manner, but we can write it using Python’s
math.erf:

def normal_cdf(x, mu=0,sigma=1):

return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2

Again, in Figure 6-3, we plot a few:

xs = [x / 10.0 for x in range(-50, 50)]

plt.plot(xs,[normal_cdf(x,sigma=1) for x in
xs],'-',label='mu=0,sigma=1') plt.plot(xs,[normal_cdf(x,sigma=2)
for x in xs],'--',label='mu=0,sigma=2') plt.plot(xs,
[normal_cdf(x,sigma=0.5) for x in
xs],':',label='mu=0,sigma=0.5') plt.plot(xs,[normal_cdf(x,mu=-1)
for x in xs],'-.',label='mu=-1,sigma=1') plt.legend(loc=4) #
bottom right

plt.title("Various Normal
cdfs") plt.show()

Figure 6-3. Various normal cdfs

Sometimes we’ll need to invert normal_cdf to find the value corresponding to

a specified probability. There’s no simple way to compute its inverse, but
normal_cdf is continuous and strictly increasing, so we can use a binary
search:
def inverse_normal_cdf(p, mu=0, sigma=1, tolerance=0.00001):

"""find approximate inverse using binary search"""

# if not standard, compute standard and rescale

if mu != 0 or sigma != 1:

return mu + sigma * inverse_normal_cdf(p, tolerance=tolerance)

low_z, low_p = -10.0, 0 # normal_cdf(-10) is (very close to) 0 hi_z, hi_p =

10.0, 1 # normal_cdf(10) is (very close to) 1 while hi_z - low_z > tolerance:

mid_z = (low_z + hi_z) / 2 # consider the midpoint mid_p =

normal_cdf(mid_z) # and the cdf's value there if mid_p < p:

# midpoint is still too low, search above it

low_z, low_p = mid_z, mid_p

elif mid_p > p:

# midpoint is still too high, search below it

hi_z, hi_p = mid_z, mid_p

else:

break return mid_z

The function repeatedly bisects intervals until it narrows in on a Z that’s close

enough to the desired probability.

The Central Limit Theorem

One reason the normal distribution is so useful is the central limit theorem,
which says (in essence) that a random variable defined as the average of a large
number of independent and identically distributed random variables is itself
approximately normally distributed.
In particular, if are random variables with mean and standard
deviation , and

is approximately normally distributed with mean and standard

deviation . Equivalently (but often more usefully),

is approximately normally distributed with mean 0 and standard deviation 1.

An easy way to illustrate this is by looking at binomial random variables,
which have two parameters n and p. A Binomial(n,p) random variable is simply
the sum of n independent Bernoulli(p) random variables, each of which equals
1 with probability p and 0 with probability :

def bernoulli_trial(p):
return 1 if random.random() < p else 0

def binomial(n, p):

return sum(bernoulli_trial(p) for _ in range(n))

The mean of a Bernoulli(p) variable is p, and its standard deviation is

. The central limit theorem says that as n gets large, a Binomial(n,p) variable is
approximately a normal random variable with mean and standard
deviation
. If we plot both, you can easily see the resemblance:

def make_hist(p, n, num_points):

data = [binomial(n, p) for _ in range(num_points)]

# use a bar chart to show the actual binomial samples

histogram = Counter(data)

plt.bar([x - 0.4 for x in histogram.keys()],

[v / num_points for v in
histogram.values()], 0.8,

color='0.75')

mu = p * n

sigma = math.sqrt(n * p * (1 - p))

# use a line chart to show the normal approximation

xs = range(min(data), max(data) + 1)

ys = [normal_cdf(i + 0.5, mu, sigma) - normal_cdf(i - 0.5, mu, sigma)

for i
in xs]
plt.plot(xs,y
s)

plt.title("Binomial Distribution vs. Normal

Approximation") plt.show()
For example, when you call make_hist(0.75, 100, 10000), you get
the graph in Figure 6-4.
Figure 6-4. The output from make_hist

The moral of this approximation is that if you want to know the probability
that (say) a fair coin turns up more than 60 heads in 100 flips, you can
estimate it as the probability that a Normal(50,5) is greater than 60, which is
easier than computing the Binomial(100,0.5) cdf. (Although in most
applications you’d probably be using statistical software that would gladly
compute whatever probabilities you want.)
Exercise

1. Write a program to compute the mean, standard deviation, and variance of all the
columns from the data.csv.
2. Write a NumPy program to compute the covariance matrix of any two arrays from
data.csv.
3. Write a NumPy program to compute the 80th percentile for all elements of
Average Pulse from data.csv.
4. Write a NumPy program to compute the median of flattened given array.
Original array: [[ 0 1 2 3 4 5] , [ 6 7 8 9 10 11]]
5. Plot normal distribution of 1000 random values using seaborn.

Entrepreneurship New Venture Creation Ch1-4. - Ocred
100% (1)
Entrepreneurship New Venture Creation Ch1-4. - Ocred
138 pages
ML Course Slides
No ratings yet
ML Course Slides
345 pages
MLCourseSlides
No ratings yet
MLCourseSlides
427 pages
ML Course Slides
No ratings yet
ML Course Slides
356 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
93 pages
Datascience Python Bayes
No ratings yet
Datascience Python Bayes
124 pages
Session 3
No ratings yet
Session 3
61 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Chapter 3 - SV
No ratings yet
Chapter 3 - SV
83 pages
CS361 FA23 Lec2 Post
No ratings yet
CS361 FA23 Lec2 Post
67 pages
DV Stat
No ratings yet
DV Stat
39 pages
CH 3 Numerical Summaries Final PDF 2 23102024 104402pm
No ratings yet
CH 3 Numerical Summaries Final PDF 2 23102024 104402pm
46 pages
5_Data Summaries and Visualization
No ratings yet
5_Data Summaries and Visualization
97 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
01 Data
No ratings yet
01 Data
100 pages
5_Data Summaries and Visualization (4)
No ratings yet
5_Data Summaries and Visualization (4)
87 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
Week 1
No ratings yet
Week 1
25 pages
Machine Learning: Dr. Muhammad Asadullah
No ratings yet
Machine Learning: Dr. Muhammad Asadullah
69 pages
E-Note_33325_Content_Document_20250319114322AM
No ratings yet
E-Note_33325_Content_Document_20250319114322AM
69 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
Stats Lect
No ratings yet
Stats Lect
77 pages
MLCourse Slides
No ratings yet
MLCourse Slides
356 pages
04-003 Statistics
No ratings yet
04-003 Statistics
14 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
15 pages
chapter2-statistical analysis
No ratings yet
chapter2-statistical analysis
86 pages
Data Analyst
No ratings yet
Data Analyst
21 pages
Descriptive Statistics W25
No ratings yet
Descriptive Statistics W25
41 pages
prw questions
No ratings yet
prw questions
31 pages
Click To Add Text Dr. Cemre Erciyes
No ratings yet
Click To Add Text Dr. Cemre Erciyes
69 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
Lecture 6
No ratings yet
Lecture 6
84 pages
Statistics
No ratings yet
Statistics
30 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
4-Demonstrate the Descriptive Statistics for a sample data like mean, median, variance and correlation etc.,-16-12-2024
No ratings yet
4-Demonstrate the Descriptive Statistics for a sample data like mean, median, variance and correlation etc.,-16-12-2024
10 pages
PC 2 Statistics by Praveen Mathur
No ratings yet
PC 2 Statistics by Praveen Mathur
44 pages
Maths
No ratings yet
Maths
30 pages
Step 1: Ask Questions
No ratings yet
Step 1: Ask Questions
30 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
(Probability and Statistics For Programmers) Allen Downey - Think Stats. Probability and Statistics For programmers-O'Reilly Media (2012) PDF
100% (10)
(Probability and Statistics For Programmers) Allen Downey - Think Stats. Probability and Statistics For programmers-O'Reilly Media (2012) PDF
142 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
Descriptive Stat Lec 1
No ratings yet
Descriptive Stat Lec 1
32 pages
(English) Measures of Spread - Crash Course Statistics #4 (DownSub - Com)
No ratings yet
(English) Measures of Spread - Crash Course Statistics #4 (DownSub - Com)
9 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
6.Lab Activity
No ratings yet
6.Lab Activity
23 pages
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
No ratings yet
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
69 pages
Statistics
No ratings yet
Statistics
23 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
Lesson 4 Notes
No ratings yet
Lesson 4 Notes
14 pages
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
No ratings yet
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
33 pages
Module 1 Overview_of_Statistics
No ratings yet
Module 1 Overview_of_Statistics
11 pages
2024 EOY 91578 Differentiation Exam.docx
No ratings yet
2024 EOY 91578 Differentiation Exam.docx
15 pages
Measures of Central Tendency and Spread: Chapter 1, Section 2
No ratings yet
Measures of Central Tendency and Spread: Chapter 1, Section 2
36 pages
Statistics and Its Types(v1.0)
No ratings yet
Statistics and Its Types(v1.0)
6 pages
ADS PRINT ans
No ratings yet
ADS PRINT ans
4 pages
Report On Construction of LED Screen Support Structure
No ratings yet
Report On Construction of LED Screen Support Structure
3 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
The Mindfulness Journal
85% (33)
The Mindfulness Journal
24 pages
GE MODMAT Unit 4 Statistics 1
No ratings yet
GE MODMAT Unit 4 Statistics 1
14 pages
Unit III
No ratings yet
Unit III
28 pages
Remedial Instruction in English
100% (1)
Remedial Instruction in English
18 pages
Automotive Electrical & Electronics Unit II: Presented by
100% (1)
Automotive Electrical & Electronics Unit II: Presented by
58 pages
Ansi Isa 12.13.01 2003
No ratings yet
Ansi Isa 12.13.01 2003
108 pages
3.1. LINEAR FUNCTIONS & RELATIONS
No ratings yet
3.1. LINEAR FUNCTIONS & RELATIONS
20 pages
Case Study on Netflix
No ratings yet
Case Study on Netflix
20 pages
568 0 D
No ratings yet
568 0 D
62 pages
ES-212-MODULE-5-Friction
No ratings yet
ES-212-MODULE-5-Friction
14 pages
ECom - CheatSheet
No ratings yet
ECom - CheatSheet
145 pages
Eced Individual
No ratings yet
Eced Individual
11 pages
Template-Measure of Effectiveness
No ratings yet
Template-Measure of Effectiveness
6 pages
9 EÈ Ú ºÃ ÄÑ Ï°×ÊÁÏ
100% (3)
9 EÈ Ú ºÃ ÄÑ Ï°×ÊÁÏ
80 pages
Department of Computer Science National Textile University Faisalabad Data Science - Lab Manual
No ratings yet
Department of Computer Science National Textile University Faisalabad Data Science - Lab Manual
24 pages
ScoopWhoop 1024 1047 1063
No ratings yet
ScoopWhoop 1024 1047 1063
12 pages
040 Form Minute of Meeting Rev 02
No ratings yet
040 Form Minute of Meeting Rev 02
8 pages
Appendix 5 Tile-Care-Maintenance
No ratings yet
Appendix 5 Tile-Care-Maintenance
35 pages
Bcom Thesis
100% (2)
Bcom Thesis
5 pages
EN15194 V5 LCS210707062AE Report
100% (1)
EN15194 V5 LCS210707062AE Report
20 pages
109106190
No ratings yet
109106190
1 page
QRS LaserJet Printers 07-28-2014
No ratings yet
QRS LaserJet Printers 07-28-2014
8 pages
Paquete Miller 7.5 y 10 Tons
No ratings yet
Paquete Miller 7.5 y 10 Tons
12 pages
Def Stan 02-337 Orings PDF
No ratings yet
Def Stan 02-337 Orings PDF
30 pages
Exercise Use of Also Too Either
No ratings yet
Exercise Use of Also Too Either
1 page
Sop 27
No ratings yet
Sop 27
1 page
Sei Lesson
No ratings yet
Sei Lesson
4 pages
Detail Engineering Assessment For Existing Buildings
100% (2)
Detail Engineering Assessment For Existing Buildings
4 pages
PIAGGIO - Managerial Competencies
No ratings yet
PIAGGIO - Managerial Competencies
6 pages
Makymanu-Bta3o-Culminating Assignment Skills Reflection
No ratings yet
Makymanu-Bta3o-Culminating Assignment Skills Reflection
9 pages
Nikola Tesla: BEFORE You Read The Story About Tesla - Answer The Questions
No ratings yet
Nikola Tesla: BEFORE You Read The Story About Tesla - Answer The Questions
3 pages
Detailed Lesson Plan
No ratings yet
Detailed Lesson Plan
4 pages
Thermo 5th Chap03P061
No ratings yet
Thermo 5th Chap03P061
22 pages
Basic Math Notes
From Everand
Basic Math Notes
Ernest Bywater
5/5 (2)
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet

Lab Plan 5: Statistics and Probability: Describing A Single Set of Data

Uploaded by

Lab Plan 5: Statistics and Probability: Describing A Single Set of Data

Uploaded by

Lab Plan 5: Statistics and Probability

Describing a Single Set of Data

Through a combination of word-of-mouth and luck, DataSciencester has

Figure 5-1. A histogram of friend counts

Unfortunately, this chart is still too difficult to slip into conversations. So

You’re probably also interested in the largest and smallest values:

largest_value = max(num_friends) # 100

But we’re only getting started.

# this isn't right if you don't from future import division

"""finds the 'middle-most' value of v"""

# if odd, return the middle value

# if even, return the average of the middle values

return (sorted_v[lo] + sorted_v[hi]) / 2

"""returns the pth-percentile value in x"""

p_index = int(p * len(x))

Less commonly you might want to look at the mode, or most-common

"""returns a list, might be more than one mode"""

return [x_i for x_i, count in counts.iteritems()

But most frequently we’ll just use the mean.

# "range" already means something in Python, so we'll use a different name

return max(x) - min(x)

return [x_i - x_bar for x_i in x]

"""assumes x has at least two elements"""

return quantile(x, 0.75) - quantile(x, 0.25)

which is quite plainly unaffected by a small number of outliers.

return dot(de_mean(x), de_mean(y)) / (n - 1)

covariance(num_friends, daily_minutes) # 22.43

Recall that dot sums up the products of corresponding pairs of elements.

def correlation(x, y):

return covariance(x, y) / stdev_x / stdev_y

return 0 # if no variation, correlation is zero

correlation(num_friends, daily_minutes) # 0.25

The correlation is unitless and always lies between -1 (perfect anti-

outlier = num_friends.index(100) # index of outlier

Without the outlier, there is a much stronger correlation (Figure 5-3).

Figure 5-3. Correlation after removing the outlier

The Normal Distribution

The normal distribution is the king of distributions. It is the classic

which we can implement as:

def normal_pdf(x, mu=0,

return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma))

xs = [x / 10.0 for x in range(-50, 50)] plt.plot(xs,

When and , it’s called the standard normal

is also normal but with mean and standard deviation . Conversely,

The cumulative distribution function for the normal distribution cannot be

def normal_cdf(x, mu=0,sigma=1):

return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2

Again, in Figure 6-3, we plot a few:

xs = [x / 10.0 for x in range(-50, 50)]

Figure 6-3. Various normal cdfs

Sometimes we’ll need to invert normal_cdf to find the value corresponding to

"""find approximate inverse using binary search"""

# if not standard, compute standard and rescale

return mu + sigma * inverse_normal_cdf(p, tolerance=tolerance)

low_z, low_p = -10.0, 0 # normal_cdf(-10) is (very close to) 0 hi_z, hi_p =

mid_z = (low_z + hi_z) / 2 # consider the midpoint mid_p =

# midpoint is still too low, search above it

low_z, low_p = mid_z, mid_p

elif mid_p > p:

# midpoint is still too high, search below it

hi_z, hi_p = mid_z, mid_p

break return mid_z

The function repeatedly bisects intervals until it narrows in on a Z that’s close

The Central Limit Theorem

is approximately normally distributed with mean and standard

is approximately normally distributed with mean 0 and standard deviation 1.

def binomial(n, p):

The mean of a Bernoulli(p) variable is p, and its standard deviation is

def make_hist(p, n, num_points):

# use a bar chart to show the actual binomial samples

plt.bar([x - 0.4 for x in histogram.keys()],

sigma = math.sqrt(n * p * (1 - p))

# use a line chart to show the normal approximation

ys = [normal_cdf(i + 0.5, mu, sigma) - normal_cdf(i - 0.5, mu, sigma)

plt.title("Binomial Distribution vs. Normal

You might also like

return (math.exp(-(x-mu) 2 / 2 / sigma 2) / (sqrt_two_pi * sigma))