1 Data and Statistics
1 Data and Statistics
Introduction to Statistics
Scope, function and misuses
Measurement Scales
Selecting Units of Analysis:-random/ non-
random sampling
Collecting statistical data
1.1 What is Statistics?
Statistics is a science that helps us make
better decisions in business and
economics as well as in other fields.
Statistics teaches us how to summarize,
analyze, and draw meaningful inferences
from data that then lead to improve
decisions.
These decisions that we make help us
improve the running of, for example, a
department, a company, the entire
economy, etc.
Cont’d
Scope
A practice in almost all fields of human
endeavor
Almost all human beings in their daily life
area
Cont’d
Uses
It presents facts in a definite and precise form
It helps for data reduction
Measuring the magnitude of variations in data
Furnishes a technique of comparison
Estimating unknown population characteristics
Testing and formulating of hypothesis.
Studying the relationship between two or more
variable
Forecasting future events
Cont’d
Limitations
Deals with only quantitative information
Deals with only aggregate of facts and not with
individual data items
Statistical data are only approximately and not
mathematical exact
Statistics can be easily misused and therefore
should be used by experts
Descriptive Inferential Statistics
Statistics Predict and
Collect forecast values
Organize of population
Summarize parameters
Display Test hypotheses
about values of
Analyze
population
parameters
Make decisions
•Nominal Scale - groups or classes
Gender, color, ethimicity etc.
•Ordinal Scale - order matters
Ranks (letter grades, income status)
•Interval Scale - difference or distance
matters – has arbitrary zero value.
Temperatures (0F, 0C), rainfall (mm)
•Ratio Scale - Ratio matters – has a
natural zero value.
Salary, weight, age etc
Samples and populations
Population- the set of all measurements
Sample- the subset of the measurements
selected from the population.
Census- is a complete enumeration
Population (N) Sample (n)
Why Sample?
Census of a population may be:
Impossible
Impractical
Too costly
Random vs non-random Sample
Advantages:
• randomization or chance
• Reliable estimates
• Generalization
Disadvantages
• More complex
• More time-consuming
• Usually more costly
Non-random samples- does not allow
chances
Advantages:
• Quick & Cheaper
• Used when sampling frame is not
available
• Often used in exploratory studies, e.g.
for hypothesis generation
Disadvantages:
Non-Random sampling
Purposive sampling
Convenience sampling
Quota sampling
Snowball sampling
Observation: basic unit or experimental Unit
For example:
• One person replying to a questionnaire in a survey
• A single animal in agricultural experiment
• A household in a community survey
• An employee in a given institution
Data : is set of observations or experimental Units
1.Qualitative or Categorical data
When the xics under study concerns a qualitative trials that is only
classified in categories and not numerically measured, the
resulting data are categorical data.
Qualitative data may be:
Nominal
Example:
Color-white/red/black
Marital status- married/ unmarried/divorced/ widowed
Sex- male/female
Ordinal
For example:
Severity of disease- 1=mild, 2= moderate, 3= severe
Degree of satisfaction- dissatisfied, satisfied, delighted
2.Quantitative data
If the xics is measured in numerical scale, the
resulting data consists of a set of numbers
called measurement data.
For Example:
Height of a plant
Strength of the bridge
Weight of a cow after treated with a diet
Quantitative variables can be classified in to discrete
and continuous.
Discrete variables: that can only take integer values.
Example:
◦ Number of people in a hhs
◦ Parity
◦ Number of car accidents
Continuous variables: that can take values in
an interval.
◦ Age, height, weight
Primary data- directly collected data
Methods for collecting such data are;
Interviews, focus group Discussion (FGD),
Observation
Secondary data- already stored data
Examples: reports, data bank etc
A tool for data collection
objectives
◦ To maximize the proportion of subjects answering our
questionnaire - that is, to maximize the response rate
◦ To obtain accurate relevant information for our survey
Two types of questions
◦ Closed- (Likert scale, semantic differential scale )
Sex: Male [ ] Female [ ]
Did you watch television last night? Yes [ ] No []
My visit has been good value for money. Strongly
agree, Agree, Disagree, Strongly disagree
Useless 1 2 3 4 5 6 7 Useful
Interesting 1 2 3 4 5 6 7 Boring
Open-ended
How was the X-mas?
The task of data collection begins after a
research problem has been defined and
research design
1. Primary collection
Collection of a fresh data for the first time
Experiments and survey(sample or census)
2. Secondary data collections
◦ Collection work is merely that of compilation
1. Survey- a conducting a field inquiry
Interview (face-to-face, telephone, postal)
2. Experiment
Units are randomly assigned to groups
Two groups: the treatment groups and control
Example: -Students treated with tutor and
non-tutor, employees given an incentive and no
incentive
3. Focus group discussions (FGD)
Trained moderator
Small group of respondents
4. Observation-Watching & recording
(especially for those who are not capable of
giving verbal reports)
Example:
In a study relating to consumer behavior, the
investigator instead of asking the brand of wrist
watch used by the respondent, may himself
look at the watch.
5. Case studies-a fairly intensive examination
of a single unit such as a person, a small group
of people, or a single company
Generalizing is not viable ( must take several
case-studies to represent certain features)
A data that has already been collected by someone
else for a different purpose.
For example, this could mean using: annual
company reports, Government statistics, and Health
care records
FAQ; Where has the data come from?
◦ Does it cover the correct geographical location?
◦ Is it current (not too out of date)?
◦ If you are going to combine with other data, are the data the
same (for example, units, time, etc.)?
◦ If you are going to compare with other data are you
comparing like with like?
Why we need a pilot survey?
◦ To refine the questionnaire
◦ To make possible amendments to the questions
Where we apply?
◦ A small-scale trial prior to the main survey that
tests all your question planning
Caution before using secondary data, must see that
they possess:
◦ Reliability of data: who?, from where? When? Level of
accuracy
◦ Suitability of data
◦ Adequacy of data
Introduction
Descriptive Statistics is a graphical or numerical index
that describes or summarizes some characteristics
of data. These characteristics of data, includes;
Numerical description of a single variable
Graphical description of a single variable
Measures of shapes
Median Middle value when
sorted in order of
magnitude
50th percentile
Mean Average
Sales Sorted Sales
9 6
6 9 Median
12 10
10 12 50th Percentile
13 13
15 14
16 14 (20+1)50/100=10.5 16 + (.5)(0) = 16
14 15
14 16
Median
16 16
17
16
16
17
The median is the middle
24
21
17
18
value of data sorted in
22
18
18
19
order of magnitude. It is
19 20 the 50th percentile.
18 21
20 22
17 24
.
. . . . . : . : : : . . . . .
---------------------------------------------------------------
6 9 10 12 13 14 15 16 17 18 19 20 21 22 24
Mode = 16
x x
m= i =1
x= i =1
N n
Sale
s
9 n
6
12
x 317
10 x= =
i =1
= 15.85
13 n 20
15
16
14
14
16 Mean is a computed average
17
16
24
21
22
18
19
18
20
17
317
.
. . . . . : . : : : . . . . .
---------------------------------------------------------------
6 9 10 12 13 14 15 16 17 18 19 20 21 22 24
Mean = 15.85
Median and Mode = 16
(x - x)
n
N 2
(x - m) 2
s =
2 i =1
s 2 = i=1
N
(n - 1)
( )
2
( x)
2
N n
x
i =1
N
x -
n
x - 2 i =1 2
= n
i =1
= i=1 N
N (n - 1)
s= s 2
s= s
2
x x-x (x - x) 2 x2 n
(x - x)
2
16
17
0.15
1.15
0.0225
1.3225
256
289 (n - 1)
17 1.15 1.3225 289 2
100489
317
18 2.15 4.6225 324 5403 - 5403 -
18 2.15 4.6225 324 = 20 = 20
19
20
3.15
4.15
9.9225
17.2225
361
400
(20 - 1) 19
21 5.15 26.5225 441 5403 - 5024.45 378.55
22 6.15 37.8225 484 = = = 19.923684
24 8.15 66.4225 576 19 19
317 0 378.5500 5403 s = s = 19.923684 = 4.46
2
x f(x) f(x)/n
Spending Class ($) Frequency (number of customers) Relative Frequency
0 .2
0. 5
0 .1
0 .0 0. 0
0 10 20 30 40 50 0 10 20 30 40 50
Sales Sales
(Cumulative frequency or
relative frequency graph)
M o n th l y S t e e l P r o d u c t io n
8 .5
Millions of Tons
7 .5
6 .5
5 .5
Mo n th J F M A M J J A S ON D J F M A M J J A S ON D J F M A M J J A S O
Pie Charts
Categories represented as percentages of total
Bar Graphs
Heights of rectangles represent group
frequencies
Figure 1-10: Twentysomethings split on job satisfication
Category
Don't like my job but it is on my career path
Job is OK, but it is not on my career path
Enjoy job, but it is not on my career path
My job just pays the bills
Happy with career
19.0%
Enjoy job, but it is not on my career path
23.0%
My job just pays the bills
Figure 1-11: SHIFTING GEARS
Quartely net income for General Motors (in billions)
1.5
1.2
0.9
0.6
0.3
0.0
1Q 2Q 3Q 4Q 1Q
2003 C4 2004
Skewness and Kurtosis: to determine ta
distribution
Skewness
◦ Measure of asymmetry of a frequency distribution
Skewed to left
Symmetric or unskewed
Skewed to right
Kurtosis
◦ Measure of flatness or peakedness of a frequency
distribution
Platykurtic (relatively flat)
Mesokurtic (normal)
Leptokurtic (relatively peaked)
Skewed to left
Symmetric
Skewed to right
Platykurtic - flat distribution
Mesokurtic - not too flat and not too peaked
Leptokurtic - peaked distribution
Techniques to determine relationships and trends,
identify outliers and influential observations, and
quickly describe or summarize data sets.
• Stem-and-Leaf Displays
Quick-and-dirty listing of all observations
Conveys some of the same information as a histogram
• Box Plots
Median
Lower and upper quartiles
Maximum and minimum
1 122355567
2 0111222346777899
3 012457
4 11257
5 0236
6 02
o X X *
Inner Q1 Median Q3
Outer Inner Outer
Fence Fence Fence Fence
Q1-1.5(IQR) Interquartile Range Q3+1.5(IQR)
Q1-3(IQR)
Q3+3(IQR)
• Scatter Plots are used to identify and report
any underlying relationships among pairs of
data sets.
• The plot consists of a scatter of points, each
point representing an observation.
• Scatter plot with
trend line.
• This type of
relationship is
known
as a positive
correlation.
Correlation will be
discussed in later
chapters.