Chapter1 2023
Chapter1 2023
Defining and
Collecting Data
Objectives
In this chapter you learn:
To understand issues that arise when defining
variables.
How to define variables.
To understand the different measurement scales.
How to collect data.
To identify different ways to collect a sample.
To understand the issues involved in data
preparation.
To understand the types of survey errors.
Why Use Statistics?
Imagine you are a promotor of a theatrical production in
the 1900s. How do you promote and price tickets?
Print flyers and place advertisements in local papers.
Set a price based on past experience.
If the tickets sell out quickly, increase the price next time.
If the tickets don’t sell, decrease the price next time.
Jump ahead about 85 years. Start using computer
systems to:
Sell more categories of tickets, such as premium-priced seats.
As customers buy tickets, we monitor sales with schedule
reports to add or remove performance dates and customise
seat pricing.
Why Use Statistics?
Jump ahead to today, your fully online ticketing system
allows you to:
Update seat inventory automatically.
Use dynamic pricing to automatically alter seat prices based on
factors like peak demand.
Gain insights into your customers based on sales data, such as
where they live or what other tickets they buy.
Knowing your customers allows you to aim advertising and
publicity at the correct target market.
Using social media, you can determine who is viewing or
reacting to your advertising.
So how effective is this modern approach to
advertising?
Why Use Statistics?
The early 2014 financial results showed that The Lion King
from Disney Theatrical Productions, which aired in 1997, was
the top-grossing show on Broadway for 2013.
This is after the grosses declined by 25% in 2009.
Four years later, grosses were up 67% and weekly grosses
typically exceeded that of the opening weeks of the show,
after being adjusted for inflation!
How did Disney increase sales by 67%? By combining
business domain knowledge with business statistics and
analytics to sell tickets.
As a musical producer on Broadway said: “We make
educated predictions on price, Disney, on the other hand,
has turned it into a science.”
Why Use Statistics?
Disney followed the plan-of-action presented in this
course. Disney had:
Collected and summarised daily and weekly data.
Performed tests and experiments on their data to analyse
it.
Using the results from these analyses, the insights were
used to develop a new interactive seating map that
allowed customers to buy tickets for specific seats and
permitted Disney to adjust ticket pricing for each seat and
each performance.
As a result, The Lion King still achieves weekly grosses
of around $2 million after more than 20 years.
Why Use Statistics?
Do you have a
Facebook profile? Yes, No
Categorical Numerical
POPULATION
A population contains all the items or
individuals of interest that you seek to
study.
SAMPLE
A sample contains only a portion of a
population of interest.
Population vs. Sample DCOVA
Population Sample
Samples
Simple Stratified
Random
Judgment Convenience
Systematic Cluster
Types of Samples:
Nonprobability Sample DCOVA
Simple
Systematic Stratified Cluster
Random
Probability Sample:
Simple Random Sample DCOVA
Population
divided into
16 clusters. Randomly selected
clusters for sample
Probability Sample:
Comparing Sampling Methods
DCOVA
Simple random sample and Systematic sample:
Simple to use.
May not be a good representation of the
population’s underlying characteristics.
Stratified sample:
Ensures representation of individuals across the
entire population.
Cluster sample:
More cost effective.
Less efficient (need larger sample to acquire the
same level of precision).
Probability Sample:
Selection with Probability
Proportionate To Size DCOVA
Copyright reserved 45
Selection with PPS
DCOVA
Copyright reserved 46
Selection with PPS
DCOVA
Copyright reserved 47
Selection with PPS
DCOVA
Copyright reserved 48
Selection with PPS: Example
DCOVA
Suppose the total monetary value of sales on 𝑁𝑁 =
1500 invoices is R225 000 and that 𝑛𝑛 = 20 invoices
must be selected according to a PPS selection
process. First, generate twenty 6-digit random
numbers between 000001 and 225000 and arrange
these numbers in ascending order.
Then use the following table to determine which
invoices (which could be less than 20 if some invoices
were selected more than once) to associate with the
random numbers:
Copyright reserved 49
Invoice Monetary Value Accumulated Interval 6-Digit
(R) value (R) Values Random
Numbers
1 10
2 112
3 78
4 150
5 5872
6 613
7 114
8 14
etc etc etc etc etc
Copyright reserved 50
Selection with PPS
DCOVA
Copyright reserved 51
Selection with PPS
DCOVA
The number 000972 also indicates invoice 5, but note
that no invoice may be selected twice.
Copyright reserved 52
Invoice Monetary Value Accumulated Interval 6-Digit
(R) value (R) Values Random
Numbers
1 10 10 01 - 10
2 112 122 11-122 000098
3 78 200 123-200
4 150 350 201-350
Sampling error:
Variation from sample to sample will always exist.
Measurement error:
Due to weaknesses in question design and / or respondent
error.
Types of Survey Errors (continued)
DCOVA
Excluded from
Coverage error
frame
Random
Sampling error differences from
sample to sample
Excel formulas:
Excel: Sampling
The Sampling tool in the Data Analysis tool
can be used for a systematic sample.
Go Data -> Data Analysis -> Sampling and
click OK.
Choose 𝑘𝑘 = 2 for a systematic sample by
selecting Periodic under Sampling Method
and set the Period equal to 2.
In the Output Range dialog click cell A10.
This produces a systematic sample of the row
indices of the population data frame.
As before, we can return the corresponding
values for 𝑋𝑋𝑋 and 𝑋𝑋𝑋 using INDEX().
Excel: Sampling
Note that we can use INDEX(array, row num) or
VLOOKUP(lookup value, table array, column index)
to return 𝑋𝑋𝑋 and 𝑋𝑋𝑋.
Chapter Summary
In this chapter we have discussed:
Understanding issues that arise when defining
variables.
How to define variables.
Understanding the different measurement scales.
How to collect data.
Identifying different ways to collect a sample.
Understanding the issues involved in data
preparation.
Understanding the types of survey errors.