0% found this document useful (0 votes)
7 views

1 Data and Statistics

The document provides an introduction to statistics, covering its definition, scope, functions, and limitations. It discusses various measurement scales, sampling methods, and data collection techniques, emphasizing the importance of statistics in decision-making across different fields. Additionally, it outlines descriptive and inferential statistics, as well as the significance of qualitative and quantitative data.

Uploaded by

gizawtade11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

1 Data and Statistics

The document provides an introduction to statistics, covering its definition, scope, functions, and limitations. It discusses various measurement scales, sampling methods, and data collection techniques, emphasizing the importance of statistics in decision-making across different fields. Additionally, it outlines descriptive and inferential statistics, as well as the significance of qualitative and quantitative data.

Uploaded by

gizawtade11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

May 2020

 Introduction to Statistics
 Scope, function and misuses
 Measurement Scales
 Selecting Units of Analysis:-random/ non-
random sampling
 Collecting statistical data
1.1 What is Statistics?
 Statistics is a science that helps us make
better decisions in business and
economics as well as in other fields.
 Statistics teaches us how to summarize,
analyze, and draw meaningful inferences
from data that then lead to improve
decisions.
 These decisions that we make help us
improve the running of, for example, a
department, a company, the entire
economy, etc.
Cont’d
Scope
 A practice in almost all fields of human

endeavor
 Almost all human beings in their daily life

are subjected to obtaining numerical facts


e.g. about price, about salary scale
 Applicable in some process e.g. invention

of certain drugs, extent of environmental


pollution.
 In industries especially in quality control

area
Cont’d
Uses
 It presents facts in a definite and precise form
 It helps for data reduction
 Measuring the magnitude of variations in data
 Furnishes a technique of comparison
 Estimating unknown population characteristics
 Testing and formulating of hypothesis.
 Studying the relationship between two or more
variable
 Forecasting future events
Cont’d
Limitations
 Deals with only quantitative information
 Deals with only aggregate of facts and not with
individual data items
 Statistical data are only approximately and not
mathematical exact
 Statistics can be easily misused and therefore
should be used by experts
 Descriptive  Inferential Statistics
Statistics Predict and
Collect forecast values
Organize of population
Summarize parameters
Display Test hypotheses
about values of
Analyze
population
parameters
Make decisions
•Nominal Scale - groups or classes
Gender, color, ethimicity etc.
•Ordinal Scale - order matters
Ranks (letter grades, income status)
•Interval Scale - difference or distance
matters – has arbitrary zero value.
Temperatures (0F, 0C), rainfall (mm)
•Ratio Scale - Ratio matters – has a
natural zero value.
Salary, weight, age etc
Samples and populations
 Population- the set of all measurements
 Sample- the subset of the measurements
selected from the population.
 Census- is a complete enumeration
Population (N) Sample (n)
Why Sample?
 Census of a population may be:
 Impossible
 Impractical
 Too costly
Random vs non-random Sample

 Sampling- process of taking samples

 Random sample- allows chance to


determine its elements
Random Sampling

 Advantages:
• randomization or chance
• Reliable estimates
• Generalization
 Disadvantages
• More complex
• More time-consuming
• Usually more costly
Non-random samples- does not allow
chances

 Advantages:
• Quick & Cheaper
• Used when sampling frame is not
available
• Often used in exploratory studies, e.g.
for hypothesis generation
 Disadvantages:

• No randomization and hence no generalization


(inference)
• May not be representative
 Random Sampling
 Simple random sampling
 Systematic random sampling
 Stratified random sampling
 Cluster sampling
 Multistage sampling

 Non-Random sampling
 Purposive sampling
 Convenience sampling
 Quota sampling
 Snowball sampling
Observation: basic unit or experimental Unit
For example:
• One person replying to a questionnaire in a survey
• A single animal in agricultural experiment
• A household in a community survey
• An employee in a given institution
Data : is set of observations or experimental Units
1.Qualitative or Categorical data
When the xics under study concerns a qualitative trials that is only
classified in categories and not numerically measured, the
resulting data are categorical data.
Qualitative data may be:
 Nominal
Example:
 Color-white/red/black
 Marital status- married/ unmarried/divorced/ widowed
 Sex- male/female
 Ordinal
For example:
 Severity of disease- 1=mild, 2= moderate, 3= severe
 Degree of satisfaction- dissatisfied, satisfied, delighted
2.Quantitative data
 If the xics is measured in numerical scale, the
resulting data consists of a set of numbers
called measurement data.
For Example:
Height of a plant
Strength of the bridge
Weight of a cow after treated with a diet
 Quantitative variables can be classified in to discrete
and continuous.
 Discrete variables: that can only take integer values.
 Example:
◦ Number of people in a hhs
◦ Parity
◦ Number of car accidents
 Continuous variables: that can take values in
an interval.
◦ Age, height, weight
Primary data- directly collected data
Methods for collecting such data are;
Interviews, focus group Discussion (FGD),
Observation
Secondary data- already stored data
Examples: reports, data bank etc
 A tool for data collection
 objectives
◦ To maximize the proportion of subjects answering our
questionnaire - that is, to maximize the response rate
◦ To obtain accurate relevant information for our survey
 Two types of questions
◦ Closed- (Likert scale, semantic differential scale )
 Sex: Male [ ] Female [ ]
 Did you watch television last night? Yes [ ] No []
 My visit has been good value for money. Strongly
agree, Agree, Disagree, Strongly disagree
 Useless 1 2 3 4 5 6 7 Useful
 Interesting 1 2 3 4 5 6 7 Boring
 Open-ended
 How was the X-mas?
 The task of data collection begins after a
research problem has been defined and
research design
1. Primary collection
 Collection of a fresh data for the first time
 Experiments and survey(sample or census)
2. Secondary data collections
◦ Collection work is merely that of compilation
1. Survey- a conducting a field inquiry
 Interview (face-to-face, telephone, postal)
2. Experiment
 Units are randomly assigned to groups
 Two groups: the treatment groups and control
Example: -Students treated with tutor and
non-tutor, employees given an incentive and no
incentive
3. Focus group discussions (FGD)
 Trained moderator
 Small group of respondents
4. Observation-Watching & recording
(especially for those who are not capable of
giving verbal reports)
Example:
In a study relating to consumer behavior, the
investigator instead of asking the brand of wrist
watch used by the respondent, may himself
look at the watch.
5. Case studies-a fairly intensive examination
of a single unit such as a person, a small group
of people, or a single company
 Generalizing is not viable ( must take several
case-studies to represent certain features)
A data that has already been collected by someone
else for a different purpose.
For example, this could mean using: annual
company reports, Government statistics, and Health
care records
FAQ; Where has the data come from?
◦ Does it cover the correct geographical location?
◦ Is it current (not too out of date)?
◦ If you are going to combine with other data, are the data the
same (for example, units, time, etc.)?
◦ If you are going to compare with other data are you
comparing like with like?
 Why we need a pilot survey?
◦ To refine the questionnaire
◦ To make possible amendments to the questions
 Where we apply?
◦ A small-scale trial prior to the main survey that
tests all your question planning
 Caution before using secondary data, must see that
they possess:
◦ Reliability of data: who?, from where? When? Level of
accuracy
◦ Suitability of data
◦ Adequacy of data
Introduction
Descriptive Statistics is a graphical or numerical index
that describes or summarizes some characteristics
of data. These characteristics of data, includes;
Numerical description of a single variable
 Graphical description of a single variable
Measures of shapes
Median  Middle value when
sorted in order of
magnitude
 50th percentile

Mode  Most frequently-


occurring value

Mean  Average
Sales Sorted Sales

9 6
6 9 Median
12 10
10 12 50th Percentile
13 13
15 14
16 14 (20+1)50/100=10.5 16 + (.5)(0) = 16
14 15
14 16
Median
16 16
17
16
16
17
The median is the middle
24
21
17
18
value of data sorted in
22
18
18
19
order of magnitude. It is
19 20 the 50th percentile.
18 21
20 22
17 24
.
. . . . . : . : : : . . . . .
---------------------------------------------------------------
6 9 10 12 13 14 15 16 17 18 19 20 21 22 24

Mode = 16

The mode is the most frequently occurring value. It


is the value with the highest frequency.
Arithmetic Mean or Average
The mean of a set of observations is their average - the
sum of the observed values divided by the number of
observations.
Population Mean Sample Mean
N n

x x
m= i =1
x= i =1

N n
Sale
s
9 n
6
12
x 317
10 x= =
i =1
= 15.85
13 n 20
15
16
14
14
16 Mean is a computed average
17
16
24
21
22
18
19
18
20
17
317
.
. . . . . : . : : : . . . . .
---------------------------------------------------------------
6 9 10 12 13 14 15 16 17 18 19 20 21 22 24

Mean = 15.85
Median and Mode = 16

Mean < Median .


 Range
Difference between maximum and minimum
values
 Interquartile Range
Difference between third and first quartile
(Q3 - Q1)
 Variance
Average*of the squared deviations from the
mean
 Standard Deviation
*Definitions of population variance and sample variance differ slightly.
Square root of the variance
Sorted
Sales Sales Rank Range: Maximum - Minimum =
9 6 1 Minimum 24 - 6 = 18
6 9 2
12 10 3
10 12 4
13 13 5 Q1 = 13 + (.25)(1) = 13.25
15 14 6 First Quartile
16 14 7
14 15 8
14 16 9
16 16 10 Q2 = Median =P50 = D5
17 16 11
16 17 12
24 17 13
21 18 14 Q3 = 18+ (.75)(1) = 18.75
22 18 15
18 19 16 Third Quartile
19 20 17 Interquartile Q3 - Q1 =
18 21 18 18.75 - 13.25 = 5.5
20 22 19 Range:
17 24 20 Maximum
Population Variance Sample Variance

(x - x)
n
N 2

(x - m) 2

s =
2 i =1

s 2 = i=1
N
(n - 1)
( )
2

( x)
2
N n
 x
i =1
N
x -
n

x - 2 i =1 2

= n
i =1
= i=1 N
N (n - 1)
s= s 2

s= s
2
x x-x (x - x) 2 x2 n

(x - x)
2

6 -9.85 97.0225 36 378.55


s =
2 i =1
=
9
10
-6.85
-5.85
46.9225
34.2225
81
100 (n - 1) (20 - 1)
12 -3.85 14.8225 144 378.55
13 -2.85 8.1225 169 = = 19.923684
14 -1.85 3.4225 196 19
14 -1.85 3.4225 196
 n x
2

15 -0.85 0.7225 225


n  i =1 
16 0.15 0.0225 256
 x - 2

16 0.15 0.0225 256 n


=
i =1

16
17
0.15
1.15
0.0225
1.3225
256
289 (n - 1)
17 1.15 1.3225 289 2
100489
317
18 2.15 4.6225 324 5403 - 5403 -
18 2.15 4.6225 324 = 20 = 20
19
20
3.15
4.15
9.9225
17.2225
361
400
(20 - 1) 19
21 5.15 26.5225 441 5403 - 5024.45 378.55
22 6.15 37.8225 484 = = = 19.923684
24 8.15 66.4225 576 19 19
317 0 378.5500 5403 s = s = 19.923684 = 4.46
2
x f(x) f(x)/n
Spending Class ($) Frequency (number of customers) Relative Frequency

0 to less than 100 30 0.163


100 to less than 200 38 0.207
200 to less than 300 50 0.272
300 to less than 400 31 0.168
400 to less than 500 22 0.120
500 to less than 600 13 0.070
184 1.000

• Example of relative frequency: 30/184 = 0.163


• Sum of relative frequencies = 1
x F(x) F(x)/n
Spending Class ($) Cumulative Frequency Cumulative Relative
Frequency

0 to less than 100 30 0.163


100 to less than 200 68 0.370
200 to less than 300 118 0.641
300 to less than 400 149 0.810
400 to less than 500 171 0.929
500 to less than 600 184 1.000

The cumulative frequency of each group is the sum of the


frequencies of that and all preceding groups.
Quantitative data

 A histogram is a chart made of bars of different


heights.
Widths and locations of bars correspond to widths
and locations of data groupings
Heights of bars correspond to frequencies or relative
frequencies of data groupings
Frequency Histogram
Relative Frequency Histogram
 Frequency Polygons
Height of line represents frequency
 Ogives
Height of line represents cumulative frequency
 Time Plots
Represents values over time
Relative Frequency Polygon Ogive
0 .3 1. 0

0 .2

0. 5

0 .1

0 .0 0. 0

0 10 20 30 40 50 0 10 20 30 40 50
Sales Sales

(Cumulative frequency or
relative frequency graph)
M o n th l y S t e e l P r o d u c t io n

8 .5
Millions of Tons

7 .5

6 .5

5 .5

Mo n th J F M A M J J A S ON D J F M A M J J A S ON D J F M A M J J A S O
 Pie Charts
Categories represented as percentages of total
 Bar Graphs
Heights of rectangles represent group
frequencies
Figure 1-10: Twentysomethings split on job satisfication
Category
Don't like my job but it is on my career path
Job is OK, but it is not on my career path
Enjoy job, but it is not on my career path
My job just pays the bills
Happy with career

6.0% Do not like my job, but it is on my career path

Happy with career


19.0%
33.0%
Job OK, but it is not on my career path

19.0%
Enjoy job, but it is not on my career path
23.0%
My job just pays the bills
Figure 1-11: SHIFTING GEARS
Quartely net income for General Motors (in billions)

1.5

1.2

0.9

0.6

0.3

0.0
1Q 2Q 3Q 4Q 1Q
2003 C4 2004
Skewness and Kurtosis: to determine ta
distribution
 Skewness
◦ Measure of asymmetry of a frequency distribution
 Skewed to left
 Symmetric or unskewed
 Skewed to right
 Kurtosis
◦ Measure of flatness or peakedness of a frequency
distribution
 Platykurtic (relatively flat)
 Mesokurtic (normal)
 Leptokurtic (relatively peaked)
Skewed to left
Symmetric
Skewed to right
Platykurtic - flat distribution
Mesokurtic - not too flat and not too peaked
Leptokurtic - peaked distribution
Techniques to determine relationships and trends,
identify outliers and influential observations, and
quickly describe or summarize data sets.
• Stem-and-Leaf Displays
 Quick-and-dirty listing of all observations
 Conveys some of the same information as a histogram
• Box Plots
 Median
 Lower and upper quartiles
 Maximum and minimum
1 122355567
2 0111222346777899
3 012457
4 11257
5 0236
6 02

Figure 1-17: Task Performance Times


Elements of a Box Plot
Smallest data Largest data point
point not below not exceeding Suspected
Outlier inner fence inner fence outlier

o X X *

Inner Q1 Median Q3
Outer Inner Outer
Fence Fence Fence Fence
Q1-1.5(IQR) Interquartile Range Q3+1.5(IQR)
Q1-3(IQR)
Q3+3(IQR)
• Scatter Plots are used to identify and report
any underlying relationships among pairs of
data sets.
• The plot consists of a scatter of points, each
point representing an observation.
• Scatter plot with
trend line.
• This type of
relationship is
known
as a positive
correlation.

Correlation will be
discussed in later
chapters.

You might also like