Chapter 1
Chapter 1
Contents 1
1
Chapter 1
Learning Objectives
• Definition of Statistics
• Real-world Applications of Statistics
• Statistical Terminologies
• Population versus Sample Data
• Descriptive Statistics and Inferential Statistics
• Data Collection Methods
What is statistics? Statistics is used in many aspects of life and has a wide spectrum of applications. It is a significant science
and has a meticulous methodology that is used in almost all applied sciences such as medicine, psychology, economics, and
actuarial science.
Statistics is the science of data. This involves collecting, classifying, summarizing, organizing, analyzing, presenting, and
Often the data are selected from some larger set of data, population, whose characteristics we wish to estimate. We call
For example, you might collect the ages of a sample of customers who shop for a particular product online to estimate the
average age of all customers who shop online for the product. Then you could use your estimate to target the Web site’s
advertisements to the appropriate age group. Notice that statistics involves two different processes: (1) describing sets of data
and (2) drawing conclusions (making estimates, decisions, predictions, etc.) about the sets of data on the basis of sampling.
So, the applications of statistics can be divided into two broad areas: descriptive statistics and inferential statistics.
Descriptive statistics utilizes numerical and graphical methods to look for patterns in a data set, to summarize the
information revealed in a data set, and to present that information in a convenient form.
2
CHAPTER 1. STATISTICS, DATA, AND STATISTICAL THINKING 3
Inferential statistics utilizes sample data to make estimates, decisions, predictions, or other generalizations about a larger
set of data.
Study 1.1 “Best-Selling Girl Scout Cookies” (Source: girlscouts.org.) Since 1917, the Girl Scouts of America have been
selling boxes of cookies. Currently, there are 12 varieties for sale: Thin Mints, Samoas, Lemonades, Tagalongs, Do-si-dos,
Trefoils, Savannah Smiles, Thanks-A-Lot, Dulce de Leche, Cranberry Citrus Crisps, Chocolate Chip, and Thank U Berry
Much. Each of the approximately 150 million boxes of Girl Scout cookies sold each year is classified by variety. The results
are summarized in Figure 1.1. From the graph, you can clearly see that the best-selling variety is Thin Mints (25%), followed
by Samoas (19%) and Tagalongs (13%). Since the figure describes the various categories of boxes of Girl Scout cookies sold,
Figure 1.1: MINITAB graph of best-selling Girl Scout cookies (Based on girlscouts.org, 2011–12 sales.)
An experimental (or observational) unit is an object (e.g., person, thing, transaction, or event) about which we collect
data.
A population is a set of all units (usually people, objects, transactions, or events) that we are interested in studying.
For example, populations may include all Canadians who were aged 65 or older, all UFV students, and all MacBook Air
models. Notice also that each set includes all the units in the population.
In studying a population, we focus on one or more characteristics or properties of the units in the population. We call such
characteristics variables. For example, we may be interested in the variables age, gender, and number of children.
A variable is a characteristic or property of an individual experimental (or observational) unit in the population.
The name variable is derived from the fact that any particular characteristic may vary among the units in a population.
Often, numerical representations for variables are not readily available, so measurement plays an important supporting role
in statistical studies.
Measurement is the process we use to assign numbers to variables of individual population units.
CHAPTER 1. STATISTICS, DATA, AND STATISTICAL THINKING 4
Data are values obtained after the process of measurement. A collection of data called a data set. When we measure a
variable for every unit of a population, it is called a census of the population. If the population you wish to study is large,
conducting a census would be prohibitively time consuming or costly. A reasonable alternative would be to select and study
For example, instead of polling all 145 million registered voters in the United States during a presidential election year,
a pollster might select and question a sample of just 1,500 voters. (See Figure 1.2.) If he is interested in the variable
“presidential preference,” he would record (measure) the preference of each vote sampled.
Figure 1.2: A sample of voter registration cards for all registered voters
A statistical inference is an estimate, prediction, or some other generalization about a population based on information
contained in a sample.
Example1.1 According to Variety (Jan. 10, 2014), the average age of Broadway ticketbuyers is 42.5 years. Suppose a
Broadway theatre executive hypothesizes that the average age of ticketbuyers to her theatre’s plays is less than 42.5 years.
To test her hypothesis, she samples 200 ticketbuyers to her theatre’s plays and determines the age of each.
After making the inference ; we also need to know its reliability—that is, how good the inference is. The only way we
can be certain that an inference about a population is correct is to include the entire population in our sample. However,
because of resource constraints (i.e., insufficient time or money), we usually can’t work with whole populations, so we base
our inferences on just a portion of the population (a sample). Thus, we introduce an element of uncertainty into our in-
ferences. Consequently, whenever possible, it is important to determine and report the reliability of each inference made.
A measure of reliability is a statement (usually quantitative) about the degree of uncertainty associated with a statistical
inference.
2. One or more variables (characteristics of the population or sample units) that are to be investigated
2. One or more variables (characteristics of the population units) that are to be investigated
4. The inference about the population based on information contained in the sample
All data (and hence the variables we measure) can be classified as one of two general types: quantitative data and
qualitative data.
Quantitative data are measurements that are recorded on a naturally occurring numerical scale.
Often, we assign arbitrary numerical values to qualitative data for ease of computer entry and analysis. But these assigned
numerical values are simply codes: They cannot be meaningfully added, subtracted, multiplied, or divided. For example, we
might code Democrat = 1, Republican = 2, and Independent = 3. Similarly, a taste tester might rank the barbecue sauces
from 1 (best) to 4 (worst). These are simply arbitrarily selected numerical codes for the categories and have no utility beyond
that.
Qualitative (or categorical) data are measurements that cannot be measured on a natural numerical scale; they can only
A designed experiment is a data collection method where the researcher exerts full control over the characteristics of
the experimental units sampled. These experiments typically involve a group of experimental units that are assigned the
An observational study is a data collection method where the experimental units sampled are observed in their natural
setting. No attempt is made to control the characteristics of the experimental units sampled. (Examples include opinion
Regardless of which data collection method is employed, it is likely that the data will be a sample from some population.
A representative sample exhibits characteristics typical of those possessed by the target population.
The most common way to satisfy the representative sample requirement is to select a random sample. A simple random
sample ensures that every subset of fixed size in the population has the same chance of being included in the sample.
A simple random sample of n experimental units is a sample selected from the population in such a way that every
The procedure for selecting a simple random sample typically relies on a random number generator. Random number
generators are available in table form online, and they are built into most statistical software packages. In addition to simple
random samples, there are more complex random sampling designs that can be employed. These include (but are not limited
to) stratified random sampling, cluster sampling, systematic sampling, and randomized response sampling. No matter what
type of sampling design you employ to collect the data for your study, be careful to avoid selection bias.
Selection bias results when a subset of experimental units in the population has little or no chance of being selected for
the sample.
Nonresponse bias is a type of selection bias that results when data on all experimental units in a sample are not obtained
Finally, even if your sample is representative of the population, the data collected may suffer from measurement error.
Measurement error refers to inaccuracies in the values of the data collected. In surveys, the error may be due to ambiguous
The growth in data collection associated with scientific phenomena, business operations, and government activities (quality
control, statistical auditing, forecasting, etc.) has been remarkable in the past several decades. Consequently, each of us has to
develop a discerning sense—an ability to use rational thought to interpret and understand the meaning of data. Quantitative
literacy can help you make intelligent decisions, inferences, and generalizations; that is, it helps you think critically using
statistics.
CHAPTER 1. STATISTICS, DATA, AND STATISTICAL THINKING 7
Statistical thinking involves applying rational thought and the science of statistics to critically assess data and inferences.
Exercises
1- Extinct birds. Biologists at the University of California (Riverside) are studying the patterns of extinction in the New
Zealand bird population. (Evolutionary Ecology Research, July 2003.) At the time of the Maori colonization of New Zealand
(prior to European contact), the following variables were measured for each bird species:
c. Nesting site (ground, cavity within ground, tree, cavity above ground)
2- Insomnia and education. Is insomnia related to education status? Researchers at the Universities of Memphis, Alabama
at Birmingham, and Tennessee investigated this question in the Journal of Abnormal Psychology (Feb. 2005). Adults living
in Tennessee were selected to participate in the study, which used a random-digit telephone dialing procedure. Two of the
many variables measured for each of the 575 study participants were number of years of education and insomnia status
(normal sleeper or chronic insomniac). The researchers discovered that the fewer the years of education, the more likely the
b. Identify the data collection method. Are there any potential biases in the method used?
3- Drafting NFL quarterbacks. The Journal of Productivity Analysis (Vol. 35, 2011) published a study of how successful
National Football League (NFL) teams are in drafting productive quarterbacks. Data were collected for all 331 quarterbacks
drafted over a 38-year period. Several variables were measured for each QB, including draft position (one of the top 10
players picked, selection between picks 11–50, or selected after pick 50), NFL winning ratio (percentage of games won), and
QB production score (higher scores indicate more productive QBs). The researchers discovered that draft position was only
weakly related to a quarterback’s performance in the NFL. They concluded that “quarterbacks taken higher [in the draft] do
Explain.
Acknowledgement
The core content of the slides are from the textbook of this course;
by