0% found this document useful (0 votes)
34 views

Nanodegree

Uploaded by

Eric Djagam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Nanodegree

Uploaded by

Eric Djagam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Lesson Overview

In this lesson, we will continue to cover more topics related to analyzing


quantitative variables and you will learn to use **measures of spread.
** Measures of spread are used to provide us an idea of how spread-
out our data are from one another.

In this lesson you will:

 Evaluate measures of spread


 Range
 Interquartile Range (IQR)
 Standard Deviation
 Variance
 Analyze outliers
 Evaluate descriptive and inferential statistics

Throughout this lesson, you will learn how to calculate these, as well as
why we would use one measure of spread over another.

Histograms

Histograms
Histograms
Histograms are super useful for understanding the different aspects of data and they are the
most common visual used for quantitative data. In the upcoming concepts, you will see
histograms used all the time to help you understand the four aspects we outlined earlier
regarding a quantitative variable:

 center
 spread
 shape
 outliers

How are Histograms are constructed?


First, we need to bin our data. Each bin represents a range of values in a dataset. The number
of values that fall in the range of each bin determines the height of each histogram bar. As
shown in the video above, changing the range of our bins can result in slightly different
visuals. However, there is no right or wrong answer in choosing how to bin, and in most
cases, the software you use will choose the appropriate bins for you.
PrécédentSuivant

Envoyer des commentaires sur la page

Weekdays vs. Weekends: What is the Difference

Weekdays vs. Weekends


The two histograms below illustrate the number of dogs Josh saw on weekdays versus
weekends. The measures of center for both histograms (mean, median, mode) are basically
the same and centered about the highest bin for both histograms, 13.

Visually, the difference between the histograms is the range or spread of dogs Josh sees
during each time period. In the upcoming lessons, we will discuss the most common ways
to measure the spread of our data.

PrécédentSuivant

Envoyer des commentaires sur la page

Introduction to Five Number Summary

Five Number Summary


Calculating the 5 Number Summary
The five-number summary consist of 5 values:

1. Minimum: The smallest number in the dataset.


2. �1Q1: The value such that 25% of the data fall below.
3. �2Q2: The value such that 50% of the data fall below.
4. �3Q3: The value such that 75% of the data fall below.
5. Maximum: The largest value in the dataset.

In the above video, we saw that calculating each of these values was essentially just finding
the median of a bunch of different datasets. Because we are essentially calculating a bunch of
medians, the calculation depends on whether we have an odd or even number of values.

Range
The range is then calculated as the difference between the maximum and the minimum.

IQR
The interquartile range is calculated as the difference between �3Q3 and �1Q1.

In the upcoming sections, you will practice this with Katie and on your own.

PrécédentSuivant

Envoyer des commentaires sur la page

Quiz: 5 Number Summary Practice

Do you know your 5 Number Summary?


Question du questionnaire
Identify the following for this dataset:

1, 5, 10, 3, 8, 12, 4, 1, 2, 8
10
9
11
8
2
5
4.5
Item
Number
Range
First Quartile
Third Quartile
Median
Envoyer
Question du questionnaire
Identify the following for this dataset:

5, 10, 3, 8, 12, 4, 1, 2, 8
5
4.5
9
9.5
2.5
11
10
Item
Number
Range
First Quartile
Third Quartile
Median
Envoyer

PrécédentSuivant

Envoyer des commentaires sur la page

What if We Only Want One Number?

What if We Only Want One Number?


Looking back at the histograms Josh created for the number of dogs he recorded seeing on
weekdays and weekends, we can use the histograms to mark the values of the 5 number
summary and create a box plot.

 Box plots are useful for quickly comparing the spread of two data sets across some
key metrics, like quartiles, maximum, and minimum.

How do we create the box plot?

1. The beginning of the line to the left of the box and the end of the line to the right of
the box represent the minimum and maximum values in a dataset.
2. The visual distance between these markings is an indication of the range of the values.
3. The box itself represents the IQR. The box begins at the Q1 value, ends at the Q3
value, and Q2, or the median, is represented by a line within the box.
From both the histograms and box plots, we can see that the number of dogs seen on
weekends varies much more than on weekdays.

However, instead of depending on a visual of the 5 number summary to compare our data, in
the next lesson, we will learn about using a single value to compare the two distribution
spreads - standard deviation.

PrécédentSuivant

Envoyer des commentaires sur la page

Introduction to Standard Deviation and Variance

Standard Deviation and Variance


Standard Deviation and Variance
The standard deviation is one of the most common measures for talking about the spread of
data. It is defined as the average distance of each observation from the mean.

In the above video, we saw this as how far individuals were from the average distance from
work (the example distances shown are examples from the full data set, the mean of just those
4 numbers is 38.5. The mean of 18 shown later in the video is the mean of the full data set
which is not shown in the video). In the next video, you will see exactly how this is
calculated.

PrécédentSuivant

Envoyer des commentaires sur la page


Standard Deviation Calculation

Standard Deviation Calculation


Note: at 2:00 the 4 in (14-10)2 = 4 = 16 should be squared. So it should be (14-10) 2 = 42 = 16

How to Calculate Standard Deviation


Dataset = 10, 14, 10, 6

1. Calculate the mean (∑�=14��)/�(i=1∑4xi)/n = 40/4 = 10


2. Calculate the distance of each observation from the mean and square the value

$$ (x_i - \overline{x})^2 $$ =

10-10 0

14-10 16

10-10 0

6-10 16

1. Calculate the **variance**, the average squared difference of each observation from the mean

$$ \frac{1}{n} \sum\limits_{i = 1}^n (x_i - \overline{x})^2 $$ =

(0+16+0+16)/4 8

1. Calculate the **standard deviation**, the square root of the variance

$$\sqrt{\frac{1}{n} \sum\limits_{i = 1}^n (x_i - \overline{x})^2} $$ =

2.83
88
is on average, how far each point in our dataset is from the mean.

PrécédentSuivant

Envoyer des commentaires sur la page

Introduction to the Standard Deviation and Variance

Other Measures of Spread


5 Number Summary
In the previous sections, we have seen how to calculate the values
associated with the five-number summary (min, �1Q1, �2Q2
, �3Q3, max), as well as the measures of spread associated with these
values (range and IQR).

For datasets that are not symmetric, the five-number summary and a
corresponding box plot are a great way to get started with
understanding the spread of your data. Although I still prefer a
histogram in most cases, box plots can be easier to compare two or
more groups. You will see this in the quizzes towards the end of this
lesson.

Variance and Standard Deviation


Two additional measures of spread that are used all the time are
the variance and standard deviation. At first glance, the variance and
standard deviation can seem overwhelming. If you do not understand
the expressions below, don't panic! In this section, I just want to give
you an overview of what the next sections will cover. We will walk
through each of these parts thoroughly in the next few sections, but the
big picture goal is to generally understand the following:

1. How the mean, variance, and standard deviation are calculated.

2. Why the measures of variance and standard deviation make sense


to capture the spread of our data.

3. Fields, where you might see these values used.

4. Why we might use the standard deviation or variance as opposed


to the values associated with the 5 number summary for a
particular dataset.

Calculation
We calculate the variance in the following way:

1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2
The variance is the average squared difference of each observation
from the mean.

To calculate the variance of a set of 10 values in a spreadsheet


application, with our 10 data points in column A, we would create a new
column B by typing in something like =A1-AVERAGE(A$1:A$10) and
copying this down for all 10 rows. This would find us the difference
between each data point and the mean average of all the data. Then we
create a new column C having the square of these differences, using the
formula =B1^2 in cell C1, and copying that down for all rows. Then in
the cell below this new column, cell C11, type in =SUM(C1:C10). This
adds up all these values in column C. Finally in cell C12, we divide this
sum by the number of data points we have, in this case, ten: =C11/10.
This cell C12 now contains the variance for our 10 data points.

More detailed guidance on using spreadsheets like this may be included


in a future lesson in your program.

The standard deviation is the square root of the variance. Therefore, the
formula for the standard deviation is the following:

1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2
In the same spreadsheet as above, to find the standard deviation of our
same set of 10 data values, we would use another cell like C13 to take
the square root of our variance measure, by typing in =sqrt(C12).

The standard deviation is a measurement that has the same units as


our original data, while the units of the variance are the square of the
units in our original data. For example, if the units in our original data
were dollars, then units of the standard deviation would also be dollars,
while the units of the variance would be dollars squared.

Again, this section is designed as background knowledge for the


following sections. If it doesn't make sense on this first pass, do not
worry. You will be guided in future sections in performing these
calculations, and building your intuition, as you work through an
example using the salary data. Then we will provide context about why
these calculations are important, and where you might see them!

Why the Standard Deviation?

Why the Standard Deviation?


Standard deviation is a common metric used to compare the spread
of two datasets. The benefits of using a single metric instead of the 5
number summary are:

 It simplifies the amount of information needed to give a measure


of spread

 It is useful for inferential statistics

Important Final Points

Important Final Points


Important Final Points
1. The variance is used to compare the spread of two different
groups. A set of data with higher variance is more spread out than
a dataset with lower variance. Be careful though, there might just
be an outlier (or outliers) that is increasing the variance when
most of the data are actually very close.
2. When comparing the spread between two datasets, the units of
each must be the same.
3. When data are related to money or the economy, higher variance
(or standard deviation) is associated with higher risk.
4. The standard deviation is used more often in practice than the
variance because it shares the units of the original dataset.
Use in the World
The standard deviation is associated with risk in finance, assists in
determining the significance of drugs in medical studies, and measures
the error of our results for predicting anything from the amount of
rainfall we can expect tomorrow to your predicted commute time
tomorrow.

These applications are beyond the scope of this lesson as they pertain
to specific fields, but know that understanding the spread of a particular
set of data is extremely important to many areas. In this lesson, you
mastered the calculation of the most common measures of spread.
Measures of Center and Spread Summary

Recap

Variable Types
We have covered a lot up to this point! We started with identifying data
types as either categorical or quantitative. We then learned we could identify
quantitative variables as either continuous or discrete. We also found we could
identify categorical variables as either ordinal or nominal.

Categorical Variables
When analyzing categorical variables, we commonly just look at the
count or percent of a group that falls into each level of a category. For
example, if we had two levels of a dog category: lab and not lab. We might
say, 32% of the dogs were lab (percent), or we might say 32 of the 100
dogs I saw were labs (count).

However, the 4 aspects associated with describing quantitative variables


are not used to describe categorical variables.

Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:

1. Measures of Center

2. Measures of Spread

3. Shape of the Distribution

4. Outliers

We looked at calculating measures of Center

1. Means

2. Medians
3. Modes

We also looked at calculating measures of Spread

1. Range

2. Interquartile Range

3. Standard Deviation

4. Variance

Calculating Variance
We saw that we could calculate the variance as:

1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2
You will also see:

1�−1∑�=1�(��−�ˉ)2n−11i=1∑n(xi−xˉ)2
The reason for this is beyond the scope of what we have covered thus
far, but you can find an explanation here.

You can commonly find answers to your questions with a quick Google
search. Now is a great time to get started with this practice! This
answer should make more sense at the completion of this lesson.

Standard Deviation vs. Variance


The standard deviation is the square root of the variance. In practice,
you usually use the standard deviation rather than the variance. The
reason for this is because the standard deviation shares the same units
with our original data, while the variance has squared units.

What Next?
In the next sections, we will be looking at the last two aspects of
quantitative variables: shape and outliers. What we know about
measures of center and measures of spread will assist in your
understanding of these final two aspects.
Supporting Materials

 Calculating Variance

You might also like