Descriptive and Inferential Statistics
Descriptive and Inferential Statistics
I. Descriptive statistics. This is a set of methods to describe data that we have collected.
Example: Of 350 randomly selected people in the town of Luserna, Italy, 280 people had the last
name Nicolussi. An example of descriptive statistics is the following statement :
Example: On the last 3 Sundays, Henry D. Car salesman sold 2, 1, and 0 new cars respectively.
An example of descriptive statistics is the following statement:
These are both descriptive statements because they can actually be verified from the information
provided.
II. Inferential statistics. This is a set of methods used to make a generalization, estimate,
prediction or decision.
Example: Of 350 randomly selected people in the town of Luserna, Italy, 280 people had the last
name Nicolussi. An example of inferential statistics is the following statement:
"80% of all people living in Italy have the last name Nicolussi."
We have no information about all people living in Italy, just about the 350 living in Luserna. We
have taken that information and generalized it to talk about all people living in Italy. The easiest
way to tell that this statement is not descriptive is by trying to verify it based upon the
information provided.
Example: On the last 3 Sundays, Henry D. Car salesman sold 2, 1, and 0 new cars respectively.
An example of inferential statistics is the following statements:
Although this statement is true for the last 3 Sundays, we do not know that this is true for all
Sundays.
"Henry is selling fewer cars lately because people have caught on to his dirty tricks."
There is nothing in the information given that tells us that this statement is true.
"Henry sold 0 cars last Sunday because he fell asleep in one of the cars on the lot."
Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm
Accessed: 4.11.2008
The major use of inferential statistics is to use information from a sample to infer something
about a population.
Questions
1) The last four semesters an instructor taught Intermediate Algebra, the following
numbers of people passed the class.
17 19 4 20
Which of the following conclusions can be obtained from purely descriptive measures and
which can be obtained by inferential methods?
a) The last four semesters the instructor taught Intermediate Algebra, an average of 15 people
passed the class.
b) The next time the instructor teaches Intermediate Algebra, we can expect approximately 15
people to pass the class.
c) This instructor will never pass more than 20 people in an Intermediate Algebra class.
d) The last four semesters the instructor taught Intermediate Algebra; no more than 20 people
passed the class.
e) Only 5 people passed one semester because the instructor was in a bad mood the entire
semester.
f) The instructor passed 20 people the last time he taught the class to keep the administration off
of his back for poor results.
g) The instructor passes so few people in his Intermediate Algebra classes because he doesn't like
teaching that class.
2) During the last week, Tony Gwynn of the San Diego Padres recorded the following
number of hits.
Which of the following conclusions can be obtained from purely descriptive methods and which
can be obtained by inferential methods?
b) Tony had 0 hits on Thursday because he used a bat that belonged to another player.
Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm
Accessed: 4.11.2008
c) During the last week, Tony averaged 2 hits per game.
e) Tony had the same total number of hits in the first 3 games as he did in the last 4 games.
65 91 85 76 85 87 79 93
82 75 100 70 88 78 83 59
87 69 89 54 74 89 83 80
94 67 77 92 82 70 94 84
96 98 46 70 90 96 88 72
It's hard to get a feel for this data in this format because it is unorganized. To construct a
frequency distribution, you should first identify the lowest and highest values in the list. We do
this because we want to be sure that each value in the list fits into one of our categories. The low
value here is 46, and the high is 100. A set of categories that would work here is 41-50, 51-60,
61-70, 71-80, 81-90, and 91-100. Here's a finished product:
Class Frequency
41-50 1
51-60 2
61-70 6
71-80 8
81-90 14
91-100 9
We can now see that the biggest number of tests was between 81 and 90, and most of the tests
were between 71 and 100.
The low number in each category (or class) is called the lower class limit, and the high number is
called the upper class limit.
• Each value should fit into a category. The classes should be mutually exhaustive.
• No value should fit into more than 1 category. The classes should be mutually exclusive;
there should be no overlapping of classes.
• Make the classes of equal size if possible. This makes it easier to compare the frequency
in one class to another.
• Avoid open-ended classes if possible such as "75 and over".
After the first two rules above, the rest are merely suggestions. Each set of data may require you
to violate some of these suggestions. The best advice is to try and follow them whenever
possible.
One further extension to the frequency distribution is to look at the percentage of values that
show up in each category. This is called a relative frequency distribution or percent
frequency distribution. Here's how the above data would be presented in this way.
Relative
Class Frequency Percent
Frequency
41-50 1 1/40 2.5%
51-60 2 2/40 5%
61-70 6 6/40 15%
71-80 8 8/40 20%
81-90 14 14/40 35%
91-100 9 9/40 22.5%
The final frequency distribution that we will discuss is the cumulative frequency distribution.
Think about the word cumulative, it generally refers to some sort of total. A cumulative
frequency distribution is a way to list how many values fit into the first class, the first 2 classes,
the first 3 classes, etc., or the last class, the last 2 classes, etc. Here's a cumulative less than
frequency distribution for the above set of data.
Cumulative
Class Frequency
(Less Than)
41-50 1 1
51-60 2 3
61-70 6 9
71-80 8 17
81-90 14 31
91-100 9 40
The 1 means that there is 1 value that is 50 or less, the 3 means that there are 3 values that are 60
or less, the 9 means that there are 9 values that are 70 or less, and so on.
Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm
Accessed: 4.11.2008
Now for a cumulative greater than frequency distribution.
Cumulative
Class Frequency (Greater
Than)
41-50 1 40
51-60 2 39
61-70 6 37
71-80 8 31
81-90 14 23
91-100 9 9
The 40 means that there are 40 values that are 41 or more, the 39 means that there are 39 values
that are 51 or more, the 37 means that there are 37 values that are 61 or more, and so on.
65 91 85 76 85 87 79 93
82 75 100 70 88 78 83 59
87 69 89 54 74 89 83 80
94 67 77 92 82 70 94 84
96 98 46 70 90 96 88 72
Class Frequency
41-50 1
51-60 2
61-70 6
71-80 8
81-90 14
91-100 9
Here's the histogram that goes with the frequency distribution. This was done using Minitab.
Example: 1200 students at the College of the Sequoias were polled and asked about the number
of parking spaces on campus. Here are the results:
Response Frequency
Too Many 300
About Right 360
Not Enough 540
To do this manually:
Next draw a circle. This circle will represent 100% of the values. Place marks every 5%, keeping
in mind that one-quarter of a circle represents 25%.
Next starting at the top, move in the clockwise direction until you reach 25%. Then move
another 30% (until you reach 55%).
Mean
The mean is the most powerful, and usually the most accurate and reliable, measure of central
tendency. When we usually hear the word "average", what we are really thinking about is the
mean. To find the mean for a set of data, we take the sum of all of the values, and divide the sum
by how many values there are. If we are looking for the mean of a sample, we denote that mean
by . This is read "x-bar":
If we are looking for the mean of a population, we denote that mean by the Greek letter , mu.
The way to calculate this mean is the same. The difference in notation is to tell a sample statistic,
, from a population parameter, . We will always use our own alphabet when discussing a
sample statistic, and the Greek alphabet to discuss a population parameter.
Joe D. Student got the following scores on his 5 statistics exams : 89, 83, 71, 95, 73. Find Joe's
mean test score.
So, why and not ? Since these represent all of Joe's 5 tests, we treat it as a population. But
again the difference here is only in name.
It will be to your advantage to be able to use your calculator to compute the mean. Your
calculator has a built in way to calculate the mean of a set of data. Most non-graphing calculators
use some or all of the following steps.
2. Make sure that your statistical registers are cleared. These are the memory locations where
your calculator stores the values.
3. Enter your numbers into the calculator by pressing the number and then hitting the key that
will "store" the number in the statistical registers. The key will either have , M+, or Data on
it.
4. Once all the numbers have been entered, push the key with over it. Usually, you will have to
push the 2nd key or the Shift key or the Inv key.
It is unique, there is sometimes more than one mode for the set of data.
Seven houses were sold last week in Visalia. Here are the selling prices : $94,900, $97,900,
$99,900, $100,900, $102,900, $107,900, and $1,250,000. Find the average selling price.
So, the average selling price is approximately $264,914.29. Is this a typical value for the set of
data? No. This is over $150,000 more than 6 of the values, and a little more than $1,000,000 less
than the seventh value. The problem here is that the $1,250,000 home is an extreme outlier for
this set of data, and has influenced the mean. If you were to calculate the mean without the
outlier, we would come up with a value of approximately $100,733.33. This number accurately
describes the other six values. Another thing to try would be to find the median.
Exercise- Mean
1) In trying to estimate Joe D. Bowler's mean bowling score, six of his games are selected at
random. The scores are 187, 169, 172, 209, 154, and 195. Find the mean for these six scores.
2) During the first 5 weeks of the 1995 NFL season, the San Francisco Forty Niners gained the
following number of yards rushing : 154, 158, 90, 78, and 109. George Seifert was interested in
his team's rushing performance through the first 5 weeks of the season. Find the mean rushing
yardage.
3) "I wonder how many points, on average, are scored by a typical NFL team in a game?"
wonders Joe D. Sports fan. Joe gets his local paper that shows all of this weekend’s results. Here
are the points scored that week :
24, 21, 14, 17, 10, 3, 20, 23, 24, 22, 21, 6, 17, 14, 20, 23, 14, 52, 7, 17, 34, 10, 7, 27, 14, 31, 7,
22, 35, 0
Example: Find the median of the following numbers : 98, 86, 46, 63, 66, 94, 31, 56, 51, 75, 48.
First put them in order : 31, 46, 48, 51, 56, 63, 66, 75, 86, 94, 98.
There are 11 values, so we can get 2 groups of 5 with one left over.
Example: Find the median of : 93, 90, 62, 44, 75, 89, 74, 100, 78, 61, 78, 81, 57, 67.
First put them in order : 44, 57, 61, 62, 67, 74, 75, 78, 78, 81, 89, 90, 93, 100.
There are 14 values, so we can get 2 groups of 7 with none left over.
44, 57, 61, 62, 67, 74, 75 78, 78, 81, 89, 90, 93, 100
The median is found by taking the mean of 75 and 78. The median is 76.5.
The median is not sensitive to outliers in the way that the mean is. Let's take a look at the same
example from the mean section regarding the selling price of the 7 homes sold in Visalia last
week.
Example: Seven houses were sold last week in Visalia. Here are the selling prices : $94,900,
$97,900, $99,900, $100,900, $102,900, $107,900, and $1,250,000. Find the median selling price.
There are seven values, so we can get 2 groups of 3 with 1 left over.
The median is $100,900. Is this value more representative of the set of values than the mean of
approximately $264,914.29? Since the median uses only the value or values in the center of the
list, the outliers are not used.
The median is used in cases where extreme outliers occur, like real estate prices (there are some
really expensive houses) and household income (some people make a lot of money). Can you
think of some other situations that would require the median?
Example: Find the mode for the following set of data : 4, 6, 6, 7, 11, 11, 11, 12
Ans. The mode is 11, because it occurs more times (3) than any other number.
One weakness of the mode is that sometimes a set of data can have more than one mode.
Example: Find the mode for the following set of data : 4, 6, 6, 6, 7, 11, 11, 11, 12
Ans. The modes are 6 and 11, because each occurs 3 times.
Sometimes a set of data doesn't have a mode. This happens when no value is repeated in the set.
Find the mode for the following set of data : 4, 5, 6, 7, 10, 11, 12, 13
So, sometimes a set of data has more than one mode, and sometimes a set of data doesn't even
have a mode. Another weakness is that the mode occasionally is not a typical value for the set of
data. Consider the set of values : 5, 5, 73, 75, 77, 78, 79, 80, 82, 83, 84. The mode is 5, but is 5
representative of this set of values? Of course not! This set of values, with the exception of the
two outliers of 5, is made up of values in the 70's and 80's. If you were told that the mode for a
set of data was 5, and you did not see the actual values, would you guess that most of the
numbers were in the 70's and 80's? Probably not.*/-
Example: A student took 5 exams in a class and had scores of 92, 75, 95, 90, and 98. Find the
variance for her test scores.
We will treat these 5 test scores as a population, since there is no suggestion that there are more
than 5 tests.
Example: Five students took an experimental exam and had scores of 92, 75, 95, 90, and 98.
Find the variance for their test scores.
http://www.stat.tamu.edu/stat30x/notes/node3.html
http://www.sdecnet.com/psychology/stathelp.htm
http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/Data_Descr_Infer.htm
http://onlinestatbook.com/chapter1/inferential.html
http://faculty.vassar.edu/lowry/webtext.html