Nanodegree
Nanodegree
Throughout this lesson, you will learn how to calculate these, as well as
why we would use one measure of spread over another.
Histograms
Histograms
Histograms
Histograms are super useful for understanding the different aspects of data and they are the
most common visual used for quantitative data. In the upcoming concepts, you will see
histograms used all the time to help you understand the four aspects we outlined earlier
regarding a quantitative variable:
center
spread
shape
outliers
Visually, the difference between the histograms is the range or spread of dogs Josh sees
during each time period. In the upcoming lessons, we will discuss the most common ways
to measure the spread of our data.
PrécédentSuivant
In the above video, we saw that calculating each of these values was essentially just finding
the median of a bunch of different datasets. Because we are essentially calculating a bunch of
medians, the calculation depends on whether we have an odd or even number of values.
Range
The range is then calculated as the difference between the maximum and the minimum.
IQR
The interquartile range is calculated as the difference between �3Q3 and �1Q1.
In the upcoming sections, you will practice this with Katie and on your own.
PrécédentSuivant
1, 5, 10, 3, 8, 12, 4, 1, 2, 8
10
9
11
8
2
5
4.5
Item
Number
Range
First Quartile
Third Quartile
Median
Envoyer
Question du questionnaire
Identify the following for this dataset:
5, 10, 3, 8, 12, 4, 1, 2, 8
5
4.5
9
9.5
2.5
11
10
Item
Number
Range
First Quartile
Third Quartile
Median
Envoyer
PrécédentSuivant
Box plots are useful for quickly comparing the spread of two data sets across some
key metrics, like quartiles, maximum, and minimum.
1. The beginning of the line to the left of the box and the end of the line to the right of
the box represent the minimum and maximum values in a dataset.
2. The visual distance between these markings is an indication of the range of the values.
3. The box itself represents the IQR. The box begins at the Q1 value, ends at the Q3
value, and Q2, or the median, is represented by a line within the box.
From both the histograms and box plots, we can see that the number of dogs seen on
weekends varies much more than on weekdays.
However, instead of depending on a visual of the 5 number summary to compare our data, in
the next lesson, we will learn about using a single value to compare the two distribution
spreads - standard deviation.
PrécédentSuivant
In the above video, we saw this as how far individuals were from the average distance from
work (the example distances shown are examples from the full data set, the mean of just those
4 numbers is 38.5. The mean of 18 shown later in the video is the mean of the full data set
which is not shown in the video). In the next video, you will see exactly how this is
calculated.
PrécédentSuivant
$$ (x_i - \overline{x})^2 $$ =
10-10 0
14-10 16
10-10 0
6-10 16
1. Calculate the **variance**, the average squared difference of each observation from the mean
(0+16+0+16)/4 8
2.83
88
is on average, how far each point in our dataset is from the mean.
PrécédentSuivant
For datasets that are not symmetric, the five-number summary and a
corresponding box plot are a great way to get started with
understanding the spread of your data. Although I still prefer a
histogram in most cases, box plots can be easier to compare two or
more groups. You will see this in the quizzes towards the end of this
lesson.
Calculation
We calculate the variance in the following way:
1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2
The variance is the average squared difference of each observation
from the mean.
The standard deviation is the square root of the variance. Therefore, the
formula for the standard deviation is the following:
1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2
In the same spreadsheet as above, to find the standard deviation of our
same set of 10 data values, we would use another cell like C13 to take
the square root of our variance measure, by typing in =sqrt(C12).
These applications are beyond the scope of this lesson as they pertain
to specific fields, but know that understanding the spread of a particular
set of data is extremely important to many areas. In this lesson, you
mastered the calculation of the most common measures of spread.
Measures of Center and Spread Summary
Recap
Variable Types
We have covered a lot up to this point! We started with identifying data
types as either categorical or quantitative. We then learned we could identify
quantitative variables as either continuous or discrete. We also found we could
identify categorical variables as either ordinal or nominal.
Categorical Variables
When analyzing categorical variables, we commonly just look at the
count or percent of a group that falls into each level of a category. For
example, if we had two levels of a dog category: lab and not lab. We might
say, 32% of the dogs were lab (percent), or we might say 32 of the 100
dogs I saw were labs (count).
Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:
1. Measures of Center
2. Measures of Spread
4. Outliers
1. Means
2. Medians
3. Modes
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Calculating Variance
We saw that we could calculate the variance as:
1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2
You will also see:
1�−1∑�=1�(��−�ˉ)2n−11i=1∑n(xi−xˉ)2
The reason for this is beyond the scope of what we have covered thus
far, but you can find an explanation here.
You can commonly find answers to your questions with a quick Google
search. Now is a great time to get started with this practice! This
answer should make more sense at the completion of this lesson.
What Next?
In the next sections, we will be looking at the last two aspects of
quantitative variables: shape and outliers. What we know about
measures of center and measures of spread will assist in your
understanding of these final two aspects.
Supporting Materials
Calculating Variance