Statistics Using Technology Second Edition Kathryn Kozak instant download
Statistics Using Technology Second Edition Kathryn Kozak instant download
https://ebookgate.com/product/statistics-using-technology-second-
edition-kathryn-kozak/
https://ebookgate.com/product/discovering-statistics-using-ibm-
spss-statistics-6th-edition-andy-field/
https://ebookgate.com/product/handbook-of-flavor-
characterization-food-science-and-technology-1st-edition-kathryn-
deibler/
https://ebookgate.com/product/a-guide-to-doing-statistics-in-
second-language-research-using-spss-and-r-2nd-edition-jenifer-
larson-hall/
https://ebookgate.com/product/using-multivariate-
statistics-5-ed-8-print-edition-tabachnick/
Learning Statistics Using R 1st Edition Randall E.
Schumacker
https://ebookgate.com/product/learning-statistics-using-r-1st-
edition-randall-e-schumacker/
https://ebookgate.com/product/using-information-technology-7th-
edition-stacey-c-sawyer/
https://ebookgate.com/product/statistics-for-censored-
environmental-data-using-minitab-and-r-statistics-in-
practice-2nd-edition-dennis-r-helsel/
https://ebookgate.com/product/jmp-start-statistics-a-guide-to-
statistics-and-data-analysis-using-jmp-4th-ed-edition-john-sall/
https://ebookgate.com/product/essentials-of-statistics-second-
edition-david-brink/
Statistics
Using
Technology
Second Edition
By Kathryn Kozak
ISBN: 978-1-329-75725-7
Statistics Using Technology
Table of Content:
Preface iii
Chapter 1: Statistical Basics 1
Section 1.1: What is Statistics? 1
Section 1.2: Sampling Methods 8
Section 1.3: Experimental Design 14
Section 1.4: How Not to Do Statistics 19
i
Statistics Using Technology
Index 447
ii
Statistics Using Technology
Preface:
I hope you find this book useful in teaching statistics. When writing this book, I tried to
follow the GAISE Standards (GAISE recommendations. (2014, January 05). Retrieved
from http://www.amstat.org/education/gaise/GAISECollege_Recommendations.pdf
), which are
1.) Emphasis statistical literacy and develop statistical understanding.
2.) Use real data.
3.) Stress conceptual understanding, rather than mere knowledge of procedure.
4.) Foster active learning in the classroom.
5.) Use technology for developing concepts and analyzing data.
To this end, I ask students to interpret the results of their calculations. I incorporated the
use of technology for most calculations. Because of that you will not find me using any
of the computational formulas for standard deviations or correlation and regression since
I prefer students understand the concept of these quantities. Also, because I utilize
technology you will not find the standard normal table, Student’s t-table, binomial table,
chi-square distribution table, and F-distribution table in the book. The only tables I
provided were for critical values for confidence intervals since they are more difficult to
find using technology. Another difference between this book and other statistics books is
the order of hypothesis testing and confidence intervals. Most books present confidence
intervals first and then hypothesis tests. I find that presenting hypothesis testing first and
then confidence intervals is more understandable for students. Lastly, I have de-
emphasized the use of the z-test. In fact, I only use it to introduce hypothesis testing, and
never utilize it again. You may also notice that when I introduced hypothesis testing and
confidence intervals, proportions were introduced before means. However, when two
sample tests and confidence intervals are introduced I switched this order. This is
because usually many instructors do not discuss the proportions for two samples.
However, you might try assigning problems for proportions without discussing it in class.
After doing two samples for means, the proportions are similar. Lastly, to aid student
understanding and interest, most of the homework and examples utilize real data. Again,
I hope you find this book useful for your introductory statistics class.
I want to make a comment about the mathematical knowledge that I assumed the students
possess. The course for which I wrote this book has a higher prerequisite than most
introductory statistics books. However, I do feel that students can read and understand
this book as long as they have had basic algebra and can substitute numbers into formulas.
I do not show how to create most of the graphs, but most students should have been
exposed to them in high school. So I hope the mathematical level is appropriate for your
course.
The technology that I utilized for creating the graphs was Microsoft Excel, and I utilized
the TI-83/84 graphing calculator for most calculations, including hypothesis testing,
confidence intervals, and probability distributions. This is because these tools are readily
available to my students. Please feel free to use any other technology that is more
appropriate for your students. Do make sure that you use some technology.
iii
Statistics Using Technology
Acknowledgments:
I would like to thank the following people for taking their valuable time to review the
book. Their comments and insights improved this book immensely.
I also want to thank Coconino Community College for granting me a sabbatical so that I
would have the time to write the book. Lastly, I want to thank my husband Rich and my
son Dylan for supporting me in this project. Without their love and support, I would not
have been able to complete the book.
On a personal note, I wanted to thank my brother, John Matic, his wife Jenelle, and their
children Hannah and Eli for their hospitality when writing the first edition. In addition to
allowing my family access to their home, John provided numerous examples and data
sets for business applications in this book. I inadvertently left this thank you out of the
first edition of the book, and for that I apologize. His help and his family’s hospitality
were invaluable to me.
iv
Chapter 1: Statistical Basics
Statistics is the study of how to collect, organize, analyze, and interpret data collected
from a group.
There are two branches of statistics. One is called descriptive statistics, which is where
you collect and organize data. The other is called inferential statistics, which is where
you analyze and interpret data. First you need to look at descriptive statistics since you
will use the descriptive statistics when making inferences.
To understand how to create descriptive statistics and then conduct inferences, there are a
few definitions that you need to look at. Note, many of the words that are defined have
common definitions that are used in non-statistical terminology. In statistics, some have
slightly different definitions. It is important that you notice the difference and utilize the
statistical definitions.
The first thing to decide in a statistical study is whom you want to measure and what you
want to measure. You always want to make sure that you can answer the question of
whom you measured and what you measured. The who is known as the individual and
the what is the variable.
Individual – a person or object that you are interested in finding out information about.
Variable – the measurement or observation of the individual.
If you put the individual and the variable into one statement, then you obtain a population.
Population – set of all values of the variable for the entire group of individuals.
Notice, the population answers who you want to measure and what you want to measure.
Make sure that your population always answers both of these questions. If it doesn’t,
then you haven’t given someone who is reading your study the entire picture. As an
example, if you just say that you are going to collect data from the senators in the U.S.
Congress, you haven’t told your reader want you are going to collect. Do you want to
1
Chapter 1: Statistical Basics
know their income, their highest degree earned, their voting record, their age, their
political party, their gender, their marital status, or how they feel about a particular issue?
Without telling what you want to measure, your reader has no idea what your study is
actually about.
Sometimes the population is very easy to collect. Such as if you are interested in finding
the average age of all of the current senators in the U.S. Congress, there are only 100
senators. This wouldn’t be hard to find. However, if instead you were interested in
knowing the average age that a senator in the U.S. Congress first took office for all
senators that ever served in the U.S. Congress, then this would be a bit more work. It is
still doable, but it would take a bit of time to collect. But what if you are interested in
finding the average diameter of breast height of all of the Ponderosa Pine trees in the
Coconino National Forest? This would be impossible to actually collect. What do you
do in these cases? Instead of collecting the entire population, you take a smaller group of
the population, kind of a snap shot of the population. This smaller group is called a
sample.
Sample – a subset from the population. It looks just like the population, but contains less
data.
How you collect your sample can determine how accurate the results of your study are.
There are many ways to collect samples. Some of them create better samples than others.
No sampling method is perfect, but some are better than others. Sampling techniques
will be discussed later. For now, realize that every time you take a sample you will find
different data values. The sample is a snapshot of the population, and there is more
information than is in the picture. The idea is to try to collect a sample that gives you an
accurate picture, but you will never know for sure if your picture is the correct picture.
Unlike previous mathematics classes where there was always one right answer, in
statistics there can be many answers, and you don’t know which are right.
Once you have your data, either from a population or a sample, you need to know how
you want to summarize the data. As an example, suppose you are interested in finding
the proportion of people who like a candidate, the average height a plant grows to using a
new fertilizer, or the variability of the test scores. Understanding how you want to
summarize the data helps to determine the type of data you want to collect. Since the
population is what we are interested in, then you want to calculate a number from the
population. This is known as a parameter. As mentioned already, you can’t really collect
the entire population. Even though this is the number you are interested in, you can’t
really calculate it. Instead you use the number calculated from the sample, called a
statistic, to estimate the parameter. Since no sample is exactly the same, the statistic
values are going to be different from sample to sample. They estimate the value of the
parameter, but again, you do not know for sure if your answer is correct.
2
Chapter 1: Statistical Basics
Parameter – a number calculated from the population. Usually denoted with a Greek
letter. This number is a fixed, unknown number that you want to find.
Statistic – a number calculated from the sample. Usually denoted with letters from the
Latin alphabet, though sometimes there is a Greek letter with a ^ (called a hat) above it.
Since you can find samples, it is readily known, though it changes depending on the
sample taken. It is used to estimate the parameter value.
One last concept to mention is that there are two different types of variables – qualitative
and quantitative. Each type of variable has different parameters and statistics that you
find. It is important to know the difference between them.
3
Chapter 1: Statistical Basics
There are different types of quantitative variables, called discrete or continuous. The
difference is in how many values can the data have. If you can actually count the number
of data values (even if you are counting to infinity), then the variable is called discrete. If
it is not possible to count the number of data values, then the variable is called continuous.
Discrete data can only take on particular values like integers. Discrete data are usually
things you count.
Continuous data can take on any value. Continuous data are usually things you measure.
4
Chapter 1: Statistical Basics
There are also are four measurement scales for different types of data with each building
on the ones below it. They are:
Measurement Scales:
Nominal – data is just a name or category. There is no order to any data and since there
are no numbers, you cannot do any arithmetic on this level of data. Examples of this are
gender, car name, ethnicity, and race.
Ordinal – data that is nominal, but you can now put the data in order, since one value is
more or less than another value. You cannot do arithmetic on this data, but you can now
put data values in order. Examples of this are grades (A, B, C, D, F), place value in a
race (1st, 2nd, 3rd), and size of a drink (small, medium, large).
Interval – data that is ordinal, but you can now subtract one value from another and that
subtraction makes sense. You can do arithmetic on this data, but only addition and
subtraction. Examples of this are temperature and time on a clock.
Ratio – data that is interval, but you can now divide one value by another and that ratio
makes sense. You can now do all arithmetic on this data. Examples of this are height,
weight, distance, and time.
Nominal and ordinal data come from qualitative variables. Interval and ratio data come
from quantitative variables.
Most people have a hard time deciding if the data are nominal, ordinal, interval, or ratio.
First, if the variable is qualitative (words instead of numbers) then it is either nominal or
ordinal. Now ask yourself if you can put the data in a particular order. If you can it is
ordinal. Otherwise, it is nominal. If the variable is quantitative (numbers), then it is
either interval or ratio. For ratio data, a value of 0 means there is no measurement. This
is known as the absolute zero. If there is an absolute zero in the data, then it means it is
ratio. If there is no absolute zero, then the data are interval. An example of an absolute
zero is if you have $0 in your bank account, then you are without money. The amount of
5
Chapter 1: Statistical Basics
money in your bank account is ratio data. Word of caution, sometimes ordinal data is
displayed using numbers, such as 5 being strongly agree, and 1 being strongly disagree.
These numbers are not really numbers. Instead they are used to assign numerical values
to ordinal data. In reality you should not perform any computations on this data, though
many people do. If there are numbers, make sure the numbers are inherent numbers, and
not numbers that were assigned.
Solution:
This is interval since it is a number, but 0 o’clock means midnight and not
the absence of time.
Solution:
This is nominal since it is not a number, and there is no specific order for
hair color.
Solution:
This is ratio since it is a number, and if you take 0 minutes to take a test, it
means you didn’t take any time to complete it.
Solution:
This is ordinal since it is not a number, but you could put the data in order
from youngest to oldest or the other way around.
2.) You wish to estimate the mean cholesterol levels of patients two days after they
had a heart attack. To estimate the mean you collect data from 28 heart patients.
State the individual, variable, population, sample, parameter, and statistic.
6
Chapter 1: Statistical Basics
3.) Print-O-Matic would like to estimate their mean salary of all employees. To
accomplish this they collect the salary of 19 employees. State the individual,
variable, population, sample, parameter, and statistic.
4.) To estimate the percentage of households in Connecticut which use fuel oil as a
heating source, a researcher collects information from 1000 Connecticut
households about what fuel is their heating source. State the individual, variable,
population, sample, parameter, and statistic.
5.) The U.S. Census Bureau needs to estimate the median income of males in the
U.S., they collect incomes from 2500 males. State the individual, variable,
population, sample, parameter, and statistic.
6.) The U.S. Census Bureau needs to estimate the median income of females in the
U.S., they collect incomes from 3500 females. State the individual, variable,
population, sample, parameter, and statistic.
7.) Eyeglassmatic manufactures eyeglasses and they would like to know the
percentage of each defect type made. They review 25,891 defects and classify
each defect that is made. State the individual, variable, population, sample,
parameter, and statistic.
8.) The World Health Organization wishes to estimate the mean density of people per
square kilometer, they collect data on 56 countries. State the individual, variable,
population, sample, parameter, and statistic
7
Chapter 1: Statistical Basics
When you choose a sample you want it to be as similar to the population as possible. If
you want to test a new painkiller for adults you would want the sample to include people
who are fat, skinny, old, young, healthy, not healthy, male, female, etc.
There are many ways to collect a sample. None are perfect, and you are not guaranteed
to collect a representative sample. That is unfortunately the limitations of sampling.
However, there are several techniques that can result in samples that give you a semi-
accurate picture of the population. Just remember to be aware that the sample may not be
representative. As an example, you can take a random sample of a group of people that
are equally males and females, yet by chance everyone you choose is female. If this
happens, it may be a good idea to collect a new sample if you have the time and money.
There are many sampling techniques, though only four will be presented here. The
simplest, and the type that is strived for is a simple random sample. This is where you
pick the sample such that every sample has the same chance of being chosen. This type
of sample is actually hard to collect, since it is sometimes difficult to obtain a complete
list of all individuals. There are many cases where you cannot conduct a truly random
sample. However, you can get as close as you can. Now suppose you are interested in
what type of music people like. It might not make sense to try to find an answer for
everyone in the U.S. You probably don’t like the same music as your parents. The
answers vary so much you probably couldn’t find an answer for everyone all at once. It
might make sense to look at people in different age groups, or people of different
ethnicities. This is called a stratified sample. The issue with this sample type is that
sometimes people subdivide the population too much. It is best to just have one
stratification. Also, a stratified sample has similar problems that a simple random sample
has. If your population has some order in it, then you could do a systematic sample.
This is popular in manufacturing. The problem is that it is possible to miss a
manufacturing mistake because of how this sample is taken. If you are collecting polling
data based on location, then a cluster sample that divides the population based on
geographical means would be the easiest sample to conduct. The problem is that if you
are looking for opinions of people, and people who live in the same region may have
similar opinions. As you can see each of the sampling techniques have pluses and
minuses. Include convenience
A simple random sample (SRS) of size n is a sample that is selected from a population
in a way that ensures that every different possible sample of size n has the same chance
of being selected. Also, every individual associated with the population has the same
chance of being selected.
8
Chapter 1: Statistical Basics
Solution:
Give each student in the class a number. Using a random number generator you
could then pick the number of students you want to pick.
Solution:
Choose 5 students from the front row. The people in the last row have no chance
of being selected.
Choose the 5 shortest students. The tallest students have no chance of being
selected.
Stratified sampling is where you break the population into groups called strata, then take
a simple random sample from each strata.
For example:
If you want to look at musical preference, you could divide the individuals into
age groups and then conduct simple random samples inside each group.
If you want to calculate the average price of textbooks, you could divide the
individuals into groups by major and then conduct simple random samples inside
each group.
Systematic sampling is where you randomly choose a starting place then select every kth
individual to measure.
For example:
You select every 5th item on an assembly line
You select every 10th name on the list
You select every 3rd customer that comes into the store.
Cluster sampling is where you break the population into groups called clusters.
Randomly pick some clusters then poll all individuals in those clusters.
For example:
A large city wants to poll all businesses in the city. They divide the city into
sections (clusters), maybe a square block for each section, and use a random
9
Chapter 1: Statistical Basics
number generator to pick some of the clusters. Then they poll all businesses in
each chosen cluster.
You want to measure whether a tree in the forest is infected with bark beetles.
Instead of having to walk all over the forest, you divide the forest up into sectors,
and then randomly pick the sectors that you will travel to. Then record whether a
tree is infected or not for every tree in that sector.
Many people confuse stratified sampling and cluster sampling. In stratified sampling you
use all the groups and some of the members in each group. Cluster sampling is the other
way around. It uses some of the groups and all the members in each group.
The four sampling techniques that were presented all have advantages and disadvantages.
There is another sampling technique that is sometimes utilized because either the
researcher doesn’t know better, or it is easier to do. This sampling technique is known as
a convenience sample. This sample will not result in a representative sample, and should
be avoided.
Convenience sample is one where the researcher picks individuals to be included that
are easy for the researcher to collect.
An example of a convenience sample is if you want to know the opinion of people about
the criminal justice system, and you stand on a street corner near the county court house,
and questioning the first 10 people who walk by. The people who walk by the county
court house are most likely involved in some fashion with the criminal justice system,
and their opinion would not represent the opinions of all individuals.
On a rare occasion, you do want to collect the entire population. In which case you
conduct a census.
Solution
This is a stratified sample since the patients where separated into different stratum
and then random samples were taken from each strata. The problem with this is
10
Chapter 1: Statistical Basics
that some types of surgeries may have more chances for complications than
others. Of course, the stratified sample would show you this.
b.) Obtain a list of patients who had surgery at all Banner Health facilities. Number
these patients, and then use a random number table to obtain the sample.
Solution
This is a random sample since each patient has the same chance of being chosen.
The problem with this one is that it will take a while to collect the data.
c.) Randomly select some Banner Health facilities from each of the seven states, and
then include all the patients on the surgery lists of the states.
Solution
This is a cluster sample since all patients are questioned in each of the selected
hospitals. The problem with this is that you could have by chance selected
hospitals that have no complications.
d.) At the beginning of the year, instruct each Banner Health facility to record any
complications from every 100th surgery.
Solution
This is a systematic sample since they selected every 100th surgery. The problem
with this is that if every 90th surgery has complications, you wouldn’t see this
come up in the data.
e.) Instruct each Banner Health facilities to record any complications from 20
surgeries this week and send in the results.
Solution
This is a convenience sample since they left it up to the facility how to do it. The
problem with convenience samples is that the person collecting the data will
probably collect data from surgeries that had no complications.
11
Chapter 1: Statistical Basics
2.) The quality control officer at a manufacturing plant needs to determine what
percentage of items in a batch are defective. The following are different sampling
techniques that could be used by the officer. Classify each as simple random
sample, stratified sample, systematic sample, cluster sample, or convenience
sample.
a.) The officer lists all of the batches in a given month. The number of defective
items is counted in randomly selected batches.
b.) The officer takes the first 10 batches and counts the number of defective items.
c.) The officer groups the batches made in a month into which shift they are made.
The number of defective items is counted in randomly selected batches in
each shift.
d.) The officer chooses every 15th batch off the line and counts the number of
defective items in each chosen batch.
e.) The officer divides the batches made in a month into which day they were
made. Then certain days are picked and every batch made that day is counted
to determine the number of defective items.
3.) You wish to determine the GPA of students at your school. Describe what
process you would go through to collect a sample if you use a simple random
sample.
4.) You wish to determine the GPA of students at your school. Describe what
process you would go through to collect a sample if you use a stratified sample.
5.) You wish to determine the GPA of students at your school. Describe what
process you would go through to collect a sample if you use a systematic sample.
12
Chapter 1: Statistical Basics
6.) You wish to determine the GPA of students at your school. Describe what
process you would go through to collect a sample if you use a cluster sample.
7.) You wish to determine the GPA of students at your school. Describe what
process you would go through to collect a sample if you use a convenience
sample.
13
Chapter 1: Statistical Basics
Solution:
This is an observational study. You are only asking a question.
14
Chapter 1: Statistical Basics
Solution:
This is an experiment. The tutor is the treatment.
Many observational studies involve surveys. A survey uses questions to collect the data
and needs to be written so that there is no bias.
No matter which experiment type you conduct, you should also consider the following:
Replication: repetition of an experiment on more than one subject so you can make sure
that the sample is large enough to distinguish true effects from random effects. It is also
the ability for someone else to duplicate the results of the experiment.
15
Chapter 1: Statistical Basics
Blind study is where the individual does not know which treatment they are getting or if
they are getting the treatment or a placebo.
Double-blind study is where neither the individual nor the researcher knows who is
getting which treatment or who is getting the treatment and who is getting the placebo.
This is important so that there can be no bias created by either the individual or the
researcher.
One last consideration is the time period that you are collecting the data over. There are
three types of time periods that you can consider.
Cross-sectional study: data observed, measured, or collected at one point in time.
Retrospective (or case-control) study: data collected from the past using records,
interviews, and other similar artifacts.
Prospective (or longitudinal or cohort) study: data collected in the future from groups
sharing common factors.
2.) You want to determine if eating more fruits reduces a person’s chance of
developing cancer. You watch people over the years and ask them to tell you how
many servings of fruit they eat each day. You then record who develops cancer.
Is this an observation or an experiment? Why?
3.) A researcher wants to evaluate whether countries with lower fertility rates have a
higher life expectancy. They collect the fertility rates and the life expectancies of
countries around the world. Is this an observation or an experiment? Why?
4.) To evaluate whether a new fertilizer improves plant growth more than the old
fertilizer, the fertilizer developer gives some plants the new fertilizer and others
the old fertilizer. Is this an observation or an experiment? Why?
5.) A researcher designs an experiment to determine if a new drug lowers the blood
pressure of patients with high blood pressure. The patients are randomly selected
to be in the study and they randomly pick which group to be in. Is this a
randomized experiment? Why or why not?
6.) Doctors trying to see if a new stint works longer for kidney patients, asks patients
if they are willing to have one of two different stints put in. During the procedure
the doctor decides which stent to put in based on which one is on hand at the time.
Is this a randomized experiment? Why or why not?
7.) A researcher wants to determine if diet and exercise together helps people lose
weight over just exercising. The researcher solicits volunteers to be part of the
study, randomly picks which volunteers are in the study, and then lets each
16
Chapter 1: Statistical Basics
volunteer decide if they want to be in the diet and exercise group or the exercise
only group. Is this a randomized experiment? Why or why not?
8.) To determine if lack of exercise reduces flexibility in the knee joint, physical
therapists ask for volunteers to join their trials. They then randomly select the
volunteers to be in the group that exercises and to be in the group that doesn’t
exercise. Is this a randomized experiment? Why or why not?
9.) You collect the weights of tagged fish in a tank. You then put an extra protein
fish food in water for the fish and then measure their weight a month later. Are
the two samples matched pairs or not? Why or why not?
11.) A business manager wants to see if a new procedure improves the processing time
for a task. The manager measures the processing time of the employees then
trains the employees using the new procedure. Then each employee performs the
task again and the processing time is measured again. Are the two samples
matched pairs or not? Why or why not?
12.) The prices of generic items are compared to the prices of the equivalent named
brand items. Are the two samples matched pairs or not? Why or why not?
13.) A doctor gives some of the patients a new drug for treating acne and the rest of
the patients receive the old drug. Neither the patient nor the doctor knows who is
getting which drug. Is this a blind experiment, double blind experiment, or
neither? Why?
14.) One group is told to exercise and one group is told to not exercise. Is this a blind
experiment, double blind experiment, or neither? Why?
15.) The researchers at a hospital want to see if a new surgery procedure has a better
recovery time than the old procedure. The patients are not told which procedure
that was used on them, but the surgeons obviously did know. Is this a blind
experiment, double blind experiment, or neither? Why?
16.) To determine if a new medication reduces headache pain, some patients are given
the new medication and others are given a placebo. Neither the researchers nor
the patients know who is taking the real medication and who is taking the placebo.
Is this a blind experiment, double blind experiment, or neither? Why?
17.) A new study is underway to track the eating and exercise patterns of people at
different time periods in the future, and see who is afflicted with cancer later in
17
Chapter 1: Statistical Basics
19.) To see if there is a link between smoking and bladder cancer, patients with
bladder cancer are asked if they currently smoke or if they smoked in the past. Is
this a cross-sectional study, a retrospective study, or a prospective study? Why?
20.) The Nurses Health Survey was a survey where nurses were asked to record their
eating habits over a period of time, and their general health was recorded. Is this
a cross-sectional study, a retrospective study, or a prospective study? Why?
21.) Consider a question that you would like to answer. Describe how you would
design your own experiment. Make sure you state the question you would like to
answer, then determine if an experiment or an observation is to be done, decide if
the question needs one or two samples, if two samples are the samples matched, if
this is a randomized experiment, if there is any blinding, and if this is a cross-
sectional, retrospective, or prospective study.
18
Chapter 1: Statistical Basics
One of the first issues you should ask is who funded the study. If the entity that
sponsored the study stands to gain either profits or notoriety from the results, then you
should question the results. It doesn’t mean that the results are wrong, but you should
scrutinize them on your own to make sure they are sound. As an example if a study says
that genetically modified foods are safe, and the study was funded by a company that
sells genetically modified food, then one may question the validity of the study. Since
the company funds the study and their profits rely on people buying their food, there may
be bias.
An experiment could have lurking or confounding variables when you cannot rule out
the possibility that the observed effect is due to some other variable rather than the factor
being studied. An example of this is when you give fertilizer to some plants and no
fertilizer to others, but the no fertilizer plants also are placed in a location that doesn’t
receive direct sunlight. You won’t know if the plants that received the fertilizer grew
taller because of the fertilizer or the sunlight. Make sure you design experiments to
eliminate the effects of confounding variables by controlling all the factors that you can.
Overgeneralization is where you do a study on one group and then try to say that it will
happen on all groups. An example is doing cancer treatments on rats. Just because the
treatment works on rats does not mean it will work on humans. Another example is that
until recently most FDA medication testing had been done on white males of a particular
age. There is no way to know how the medication affects other genders, ethnic groups,
age groups, and races. The new FDA guidelines stresses using individuals from different
groups.
Cause and effect is where people decide that one variable causes the other just because
the variables are related or correlated. Unless the study was done as an experiment where
a variable was controlled, you cannot say that one variable caused the other. Most likely
there is another variable that caused both. As an example, there is a relationship between
number of drownings at the beach and ice cream sales. This does not mean that ice
cream sales increasing causes people to drown. Most likely the cause for both increasing
is the heat.
Sampling error: This is the difference between the sample results and the true
population results. This is unavoidable, and results in the fact that samples are different
from each other. As an example, if you take a sample of 5 people’s height in your class,
you will get 5 numbers. If you take another sample of 5 people’s heights in your class,
you will likely get 5 different numbers.
19
Chapter 1: Statistical Basics
Nonsampling error: This is where the sample is collected poorly either through a biased
sample or through error in measurements. Care should be taken to avoid this error.
Lastly, there should be care taken in considering the difference between statistical
significance versus practical significance. This is a major issue in statistics.
Something could be statistically significance, which means that a statistical test shows
there is evidence to show what you are trying to prove. However, in practice it doesn’t
mean much or there are other issues to consider. As an example, suppose you find that a
new drug for high blood pressure does reduce the blood pressure of patients. When you
look at the improvement it actually doesn’t amount to a large difference. Even though
statistically there is a change, it may not be worth marketing the product because it really
isn’t that big of a change. Another consideration is that you find the blood pressure
medication does improve a person’s blood pressure, but it has serious side effects or it
costs a great deal for a prescription. In this case, it wouldn't be practical to use it. In both
cases, the study is shown to be statistically significant, but practically you don’t want to
use the medication. The main thing to remember in a statistical study is that the statistics
is only part of the process. You also want to make sure that there is practical significance
too.
Surveys have their own areas of bias that can occur. A few of the issues with surveys are
in the wording of the questions, the ordering of the questions, the manner the survey is
conducted, and the response rate of the survey.
The wording of the questions can cause hidden bias, which is where the questions are
asked in a way that makes a person respond a certain way. An example is that a poll was
done where people were asked if they believe that there should be an amendment to the
constitution protecting a woman’s right to choose. About 60% of all people questioned
said yes. Another poll was done where people were asked if they believe that there
should be an amendment to the constitution protecting the life of an unborn child. About
60% of all people questioned said yes. These two questions deal with the same issue,
though giving opposite results, but how the question was asked affected the outcome.
The ordering of the question can also cause hidden bias. An example of this is if you
were asked if there should be a fine for texting while driving, but proceeding that
question is the question asking if you text while drive. By asking a person if they
actually partake in the activity, that person now personalizes the question and that might
affect how they answer the next question of creating the fine.
Non-response is where you send out a survey but not everyone returns the survey. You
can calculate the response rate by dividing the number of returns by the number of
surveys sent. Most response rates are around 30-50%. A response rate less than 30% is
very poor and the results of the survey are not valid. To reduce non-response, it is better
to conduct the surveys in person, though these are very expensive. Phones are the next
best way to conduct surveys, emails can be effective, and physical mailings are the least
desirable way to conduct surveys.
20
Chapter 1: Statistical Basics
Voluntary response is where people are asked to respond via phone, email or online.
The problem with these is that only people who really care about the topic are likely to
call or email. These surveys are not scientific and the results from these surveys are not
valid. Note: all studies involve volunteers. The difference between a voluntary response
survey and a scientific study is that in a scientific study the researchers ask the
individuals to be involved, while in a voluntary response survey the individuals become
involved on their own choosing.
Solution
Since there were different teachers, you do not know if the better test scores are
because of the teacher or the computer-based homework. A better design would
be have the same teacher teach both classes. The control group would utilize
traditional paper and pencil homework and the treatment group would utilize the
computer-based homework. Both classes would have the same teacher, and the
students would be split between the two classes randomly. The only difference
between the two groups should be the homework method. Of course, there is still
variability between the students, but utilizing the same teacher will reduce any
other confounding variables.
Solution:
Since this was a study where the use of cinnamon was controlled, and all other
factors were kept constant from person to person, then any changes in glucose
levels can be attributed to the use of cinnamon.
b.) There is a link between spray on tanning products and lung cancer. Does that
mean that spray on tanning products cause lung cancer?
Solution:
Since there is only a link, and not a study controlling the use of the tanning
spray, then you cannot say that increased use causes lung cancer. You can say
that there is a link, and that there could be a cause, but you cannot say for sure
that the spray causes the cancer.
21
Chapter 1: Statistical Basics
Solution:
No. Just because a drug is safe to use on one species doesn’t mean it is safe to
use for all species. In fact, ibuprofen is toxic to cats.
b.) Aspirin has been used for years to bring down fevers in humans. Originally it
was tested on white males between the ages of 25 and 40 and found to be safe.
Is it safe to give to everyone?
Solution:
No. Just because one age group can use it doesn’t mean it is safe to use for all
age groups. In fact, there has been a link between giving a child under the age
of 19 aspirin when they have a fever and Reye’s syndrome.
2.) Suppose a car dealership offers a low interest rate and a longer payoff period to
customers or a high interest rate and a shorter payoff period to customers, and
most customers choose the low interest rate and longer payoff period, does that
mean that most customers want a lower interest rate? Explain.
3.) Over the years it has been said that coffee is bad for you. When looking at the
studies that have shown that coffee is linked to poor health, you will see that
people who tend to drink coffee don’t sleep much, tend to smoke, don’t eat
healthy, and tend to not exercise. Can you say that the coffee is the reason for the
poor health or is there a lurking variable that is the actual cause? Explain.
4.) When researchers were trying to figure out what caused polio, they saw a
connection between ice cream sales and polio. As ice cream sales increased so
did the incident of polio. Does that mean that eating ice cream causes polio?
Explain your answer.
5.) There is a positive correlation between having a discussion of gun control, which
usually occur after a mass shooting, and the sale of guns. Does that mean that the
discussion of gun control increases the likelihood that people will buy more guns?
Explain.
22
Chapter 1: Statistical Basics
6.) There is a study that shows that people who are obese have a vitamin D
deficiency. Does that mean that obesity causes a deficiency in vitamin D?
Explain.
7.) A study was conducted that shows that polytetrafluoroethylene (PFOA) (Teflon is
made from this chemical) has an increase risk of tumors in lab mice. Does that
mean that PFOA’s have an increased risk of tumors in humans? Explain.
8.) Suppose a telephone poll is conducted by contacting U.S. citizens via landlines
about their view of gay marriage. Suppose over 50% of those called do not
support gay marriage. Does that mean that you can say over 50% of all people in
the U.S. do not support gay marriage? Explain.
10.) You are testing a new drug for weight loss. You find that the drug does in fact
statistically show a weight loss. Do you market the new drug? Why or why not?
11.) There was an online poll conducted about whether the mayor of Auckland, New
Zealand, should resign due to an affair. The majority of people participating said
he should. Should the mayor resign due to the results of this poll? Explain.
12.) An online poll showed that the majority of Americans believe that the government
covered up events of 9/11. Does that really mean that most Americans believe
this? Explain.
13.) A survey was conducted at a college asking all employees if they were satisfied
with the level of security provided by the security department. Discuss how the
results of this question could be biased.
14.) An employee survey says, “Employees at this institution are very satisfied with
working here. Please rate your satisfaction with the institution.” Discuss how
this question could create bias.
15.) A survey has a question that says, “Most people are afraid that they will lose their
house due to economic collapse. Choose what you think is the biggest issue
facing the nation today. a) Economic collapse, b) Foreign policy issues, c)
Environmental concerns.” Discuss how this question could create bias.
16.) A survey says, “Please rate the career of Roberto Clemente, one of the best right
field baseball players in the world.” Discuss how this question could create bias.
23
Chapter 1: Statistical Basics
24
Chapter 2: Graphical Descriptions of Data
This chapter will focus mostly on using the graphs to understand aspects of the data, and
not as much on how to create the graphs. There is technology that will create most of the
graphs, though it is important for you to understand the basics of how to create them.
Pie charts and bar graphs are the most common ways of displaying qualitative data. A
spreadsheet program like Excel can make both of them. The first step for either graph is
to make a frequency or relative frequency table. A frequency table is a summary of
the data with counts of how often a data value (or category) occurs.
Ford, Chevy, Honda, Toyota, Toyota, Nissan, Kia, Nissan, Chevy, Toyota,
Honda, Chevy, Toyota, Nissan, Ford, Toyota, Nissan, Mercedes, Chevy,
Ford, Nissan, Toyota, Nissan, Ford, Chevy, Toyota, Nissan, Honda,
Porsche, Hyundai, Chevy, Chevy, Honda, Toyota, Chevy, Ford, Nissan,
Toyota, Chevy, Honda, Chevy, Saturn, Toyota, Chevy, Chevy, Nissan,
Honda, Toyota, Toyota, Nissan
25
Chapter 2: Graphical Descriptions of Data
A listing of data is too hard to look at and analyze, so you need to summarize it.
First you need to decide the categories. In this case it is relatively easy; just use
the car type. However, there are several cars that only have one car in the list. In
that case it is easier to make a category called other for the ones with low values.
Now just count how many of each type of cars there are. For example, there are 5
Fords, 12 Chevys, and 6 Hondas. This can be put in a frequency distribution:
The total of the frequency column should be the number of observations in the
data.
Since raw numbers are not as useful to tell other people it is better to create a third
column that gives the relative frequency of each category. This is just the
frequency divided by the total. As an example for Ford category:
This can be written as a decimal, fraction, or percent. You now have a relative
frequency distribution:
The relative frequency column should add up to 1.00. It might be off a little due
to rounding errors.
26
Chapter 2: Graphical Descriptions of Data
Now that you have the frequency and relative frequency table, it would be good to
display this data using a graph. There are several different types of graphs that can be
used: bar chart, pie chart, and Pareto charts.
Bar graphs or charts consist of the frequencies on one axis and the categories on the
other axis. Then you draw rectangles for each category with a height (if frequency is on
the vertical axis) or length (if frequency is on the horizontal axis) that is equal to the
frequency. All of the rectangles should be the same width, and there should be equally
width gaps between each bar.
Put the frequency on the vertical axis and the category on the horizontal axis.
Then just draw a box above each category whose height is the frequency.
All graphs are drawn using R. The command in R to create a bar graph is:
variable<-c(type in percentages or frequencies for each class with commas
in between values)
barplot(variable,names.arg=c("type in name of 1st category", "type in
name of 2nd category",…,"type in name of last category"),
ylim=c(0,number over max), xlab="type in label for x-axis", ylab="type in
label for y-axis",ylim=c(0,number above maximum y value), main="type
in title", col="type in a color") – creates a bar graph of the data in a color
if you want.
For this example the command would be:
car<-c(5, 12, 6, 12, 10, 5)
barplot(car, names.arg=c("Ford", "Chevy", "Honda", "Toyota", "Nissan",
"Other"), xlab="Type of Car", ylab="Frequency", ylim=c(0,12),
main="Type of Car Driven by College Students", col="blue")
27
Chapter 2: Graphical Descriptions of Data
Notice from the graph, you can see that Toyota and Chevy are the more popular
car, with Nissan not far behind. Ford seems to be the type of car that you can tell
was the least liked, though the cars in the other category would be liked less than
a Ford.
You can also draw a bar graph using relative frequency on the vertical axis. This is
useful when you want to compare two samples with different sample sizes. The relative
frequency graph and the frequency graph should look the same, except for the scaling on
the frequency axis.
28
Chapter 2: Graphical Descriptions of Data
Graph #2.1.2: Relative Frequency Bar Graph for Type of Car Data
Another type of graph for qualitative data is a pie chart. A pie chart is where you have a
circle and you divide pieces of the circle into pie shapes that are proportional to the size
of the relative frequency. There are 360 degrees in a full circle. Relative frequency is
just the percentage as a decimal. All you have to do to find the angle by multiplying the
relative frequency by 360 degrees. Remember that 180 degrees is half a circle and 90
degrees is a quarter of a circle.
29
Chapter 2: Graphical Descriptions of Data
Then you multiply each relative frequency by 360° to obtain the angle
measure for each category.
Now draw the pie chart using a compass, protractor, and straight edge.
Technology is preferred. If you use technology, there is no need for the
relative frequencies or the angles.
You can use R to graph the pie chart. In R, the commands would be:
pie(variable,labels=c("type in name of 1st category", "type in name of 2nd
category",…,"type in name of last category"),main="type in title",
col=rainbow(number of categories)) – creates a pie chart with a title and
rainbow of colors for each category.
30
Chapter 2: Graphical Descriptions of Data
As you can see from the graph, Toyota and Chevy are more popular, while the
cars in the other category are liked the least. Of the cars that you can determine
from the graph, Ford is liked less than the others.
Pie charts are useful for comparing sizes of categories. Bar charts show similar
information. It really doesn’t matter which one you use. It really is a personal preference
and also what information you are trying to address. However, pie charts are best when
you only have a few categories and the data can be expressed as a percentage. The data
doesn’t have to be percentages to draw the pie chart, but if a data value can fit into
multiple categories, you cannot use a pie chart. As an example, if you asking people
about what their favorite national park is, and you say to pick the top three choices, then
the total number of answers can add up to more than 100% of the people involved. So
you cannot use a pie chart to display the favorite national park.
A third type of qualitative data graph is a Pareto chart, which is just a bar chart with the
bars sorted with the highest frequencies on the left. Here is the Pareto chart for the data
in Example #2.1.1.
31
Chapter 2: Graphical Descriptions of Data
The advantage of Pareto charts is that you can visually see the more popular answer to
the least popular. This is especially useful in business applications, where you want to
know what services your customers like the most, what processes result in more injuries,
which issues employees find more important, and other type of questions like these.
There are many other types of graphs that can be used on qualitative data. There are
spreadsheet software packages that will create most of them, and it is better to look at
them to see what can be done. It depends on your data as to which may be useful. The
next example illustrates one of these types known as a multiple bar graph.
32
Chapter 2: Graphical Descriptions of Data
Solution:
It appears that Dylan spends more time on balance exercises than on any other
exercises on any given day. He seems to spend less time on strength exercises on
a given day. There are several days when the amount of exercise in the different
categories is almost equal.
The usefulness of a multiple bar graph is the ability to compare several different
categories over another variable, in example #2.1.4 the variable would be time. This
allows a person to interpret the data with a little more ease.
33
Chapter 2: Graphical Descriptions of Data
2.) To analyze how Arizona workers ages 16 or older travel to work the percentage of
workers using carpool, private vehicle (alone), and public transportation was
collected. Create a bar chart and pie chart of the data in table #2.1.5. State any
findings you can see from the graphs.
3.) The number of deaths in the US due to carbon monoxide (CO) poisoning from
generators from the years 1999 to 2011 are in table #2.1.6 (Hinatov, 2012).
Create a bar chart and pie chart of this data. State any findings you see from the
graphs.
4.) In Connecticut households use gas, fuel oil, or electricity as a heating source.
Table #2.1.7 shows the percentage of households that use one of these as their
principle heating sources ("Electricity usage," 2013), ("Fuel oil usage," 2013),
("Gas usage," 2013). Create a bar chart and pie chart of this data. State any
findings you see from the graphs.
34
Chapter 2: Graphical Descriptions of Data
5.) Eyeglassomatic manufactures eyeglasses for different retailers. They test to see
how many defective lenses they made during the time period of January 1 to
March 31. Table #2.1.8 gives the defect and the number of defects. Create a
Pareto chart of the data and then describe what this tells you about what causes
the most defects.
6.) People in Bangladesh were asked to state what type of birth control method they
use. The percentages are given in table #2.1.9 ("Contraceptive use," 2013).
Create a Pareto chart of the data and then state any findings you can from the
graph.
35
Chapter 2: Graphical Descriptions of Data
7.) The percentages of people who use certain contraceptives in Central American
countries are displayed in graph #2.1.6 ("Contraceptive use," 2013). State any
findings you can from the graph.
36
Chapter 2: Graphical Descriptions of Data
This leads to the second difference from bar graphs. In a bar graph, the categories that
you made in the frequency table were determined by you. In quantitative data, the
categories are numerical categories, and the numbers are determined by how many
categories (or what are called classes) you choose. If two people have the same number
of categories, then they will have the same frequency distribution. Whereas in qualitative
data, there can be many different categories depending on the point of view of the author.
The third difference is that the categories touch with quantitative data, and there will be
no gaps in the graph. The reason that bar graphs have gaps is to show that the categories
do not continue on, like they do in quantitative data. Since the graph for quantitative data
is different from qualitative data, it is given a new name. The name of the graph is a
histogram. To create a histogram, you must first create the frequency distribution. The
idea of a frequency distribution is to take the interval that the data spans and divide it up
into equal subintervals called classes.
37
Chapter 2: Graphical Descriptions of Data
boundaries, subtract 0.5 from the lower class limit and add 0.5 to the upper
class limit.
6. Sometimes it is useful to find the class midpoint. The process is
7. To figure out the number of data points that fall in each class, go through each
data value and see which class boundaries it is between. Utilizing tally marks
may be helpful in counting the data values. The frequency for a class is the
number of data values that fall in the class.
Note: the above description is for data values that are whole numbers. If you data value
has decimal places, then your class width should be rounded up to the nearest value with
the same number of decimal places as the original data. In addition, your class
boundaries should have one more decimal place than the original data. As an example, if
your data have one decimal place, then the class width would have one decimal place,
and the class boundaries are formed by adding and subtracting 0.05 from each class limit.
Solution:
1) Find the range:
Round up to 315.
Always round up to the next integer even if the width is already an integer.
38
Chapter 2: Graphical Descriptions of Data
Start at the smallest value. This is the lower class limit for the first class.
Add the width to get the lower limit of the next class. Keep adding the
width to get all the lower limits.
The upper limit is one less than the next lower limit: so for the first class
the upper class limit would be .
When you have all 7 classes, make sure the last number, in this case the 2550, is
at least as large as the largest value in the data. If not, you made a mistake
somewhere.
Subtract 0.5 from the lower class limit to get the class boundaries. Add
0.5 to the upper class limit for the last class’s boundary.
Every value in the data should fall into exactly one of the classes. No data
values should fall right on the boundary of two classes.
Go through the data and put a tally mark in the appropriate class for each
piece of data by looking to see which class boundaries the data value is
between. Fill in the frequency by changing each of the tallies into a
number.
39
Discovering Diverse Content Through
Random Scribd Documents
1.F.1. Project Gutenberg volunteers and employees expend
considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite these
efforts, Project Gutenberg™ electronic works, and the medium
on which they may be stored, may contain “Defects,” such as,
but not limited to, incomplete, inaccurate or corrupt data,
transcription errors, a copyright or other intellectual property
infringement, a defective or damaged disk or other medium, a
computer virus, or computer codes that damage or cannot be
read by your equipment.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
ebookgate.com