0% found this document useful (0 votes)
2 views

L2-Types of Data, Central Tendency and Dispersion-2

The document discusses the merits and demerits of statistics, emphasizing its role in data summarization and decision-making while noting its limitations regarding individual data and emotional factors. It categorizes data sources into primary and secondary, outlines types of variables (quantitative and qualitative), and explains measures of central tendency (mean, median, mode) and variation (range, quartiles). Additionally, it covers data visualization techniques such as bar charts, histograms, and pie charts for effective data presentation.

Uploaded by

NAHLA ELKHOLY
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

L2-Types of Data, Central Tendency and Dispersion-2

The document discusses the merits and demerits of statistics, emphasizing its role in data summarization and decision-making while noting its limitations regarding individual data and emotional factors. It categorizes data sources into primary and secondary, outlines types of variables (quantitative and qualitative), and explains measures of central tendency (mean, median, mode) and variation (range, quartiles). Additionally, it covers data visualization techniques such as bar charts, histograms, and pie charts for effective data presentation.

Uploaded by

NAHLA ELKHOLY
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 81

Looking at Data

Descriptive Statistics
MERITS OF
STATISTICS
 Summarization of data
 Grouping and presentation

 Facilitates comparison

 Planning and decision making

 Evaluation
DEMERITS OF
STATISTICS
 Concerned only with groups and
neglecting individuals

 Does not consider feelings or emotions

 Can be used to derive false conclusion


SOURCES OF DATA
 Surveillance system
 Planned surveys
 Experiments
 Health organizations
 Governmental sources
 Private sources
 Internet
CLASSIFICATION OF DATA
ACCORDING TO SOURCE

A) Primary data i.e. data collected


by the investigator himself

B) Secondary data i.e. data not


collected by the investigator
TYPES OF VARIABLES
1- Quantitative variables;
* CONTIUOUS e.g. weight and height
* DISCRETE e.g. no. of persons, pulse

2- Qualitative variables;
* NOMINAL e.g. sex, blood group,
* ORDINAL e.g. social class, disease
severity or grade
SOURCES OF ERROR
DURING DATA COLLECTION

1- Faulty technique
2- Defective instrument
3- Inter-observer error
4- Intra-observer error
5- Typographical error
Types of Variables:
Overview
Categorical Quantitative

binary nominal ordinal discrete continuous


2 categories +
more categories +
order matters +
numerical +
uninterrupted
Categorical Variables
 Also known as “qualitative.”

 Categories.


treatment groups

exposure groups

disease status
Categorical Variables
 Dichotomous (binary) – two levels


Dead/alive

Treatment/placebo

Disease/no disease

Exposed/Unexposed

Heads/Tails

Pulmonary Embolism (yes/no)

Male/female
Categorical Variables
 Nominal variables – Named
categories Order doesn’t matter!


The blood type of a patient (O, A, B, AB)

Marital status

Occupation
Categorical Variables
 Ordinal variable – Ordered categories.
Order matters!


Staging in breast cancer as I, II, III, or IV

Birth order—1st, 2nd, 3rd, etc.

Level of education

Ratings on a scale from 1-5

Likert scale

Age in categories (10-20, 20-30, etc.)

Level of socio economic standard
Quantitative Variables
 Numerical variables; may be
arithmetically manipulated.

 Counts
 Time
 Age
 Height
Quantitative Variables
 Discrete Numbers – a limited set of
distinct values, such as whole numbers.


Number of new AIDS cases in CA in a year
(counts)

Years of school completed

The number of children in the family (cannot have
a half a child!)

The number of deaths in a defined time period
(cannot have a partial death!)

Roll of a die
Quantitative Variables
 Continuous Variables - Can take on any
number within a defined range.

Time-to-event (survival time)

Age

Blood pressure

Serum insulin

Speed of a car

Income

Shock index (Kline et al.)
Looking at Data
  How are the data distributed?
 Where is the center?
 What is the range?
 What’s the shape of the distribution (e.g.,
Gaussian, binomial, exponential, skewed)?

 Are there “outliers”?

 Are there data points that don’t make


sense?
Frequency Plots
(univariate)
Categorical variables
 Bar Chart

Continuous variables
 Box Plot

 Histogram
Types of graphs
 Line graph
 Cumulative frequency curve
 Bar graph
 Histogram
 Frequency polygon
 Pie chart
 Squares
 Figures and shapes
Linear graph
 It is used when plotting
quantitative data against a time
factor
 Another ex. Fever and fluid chart
different in different
year

‫معدلالحدوثلكل‬
100000
25

20

15

10

0
1996` 1997 1998 1999 2000
‫السنوات‬
Bar Chart
 Used for categorical variables to
show frequency or proportion in
each category.
 Translate the data from frequency
tables into a pictorial
representation…
Bar Chart: categorical
variables

no

ye
s
Bar Chart for SI categories
200.0
Much easier to
183.3
extract
166.7
information from
Number of Patients

150.0
a bar chart than
133.3
from a table!
116.7
100.0
83.3
66.7
50.0
33.3
16.7
0.0
1 2 3 4 5 6 7 8 9 10

Shock Index Category


Distribution of infant death
according to birth weight
‫ أعدادالوفيات‬45
40
35
30
25
20
15
10
5
0
Norma BW Low B.W Very Low.B.W
‫الوزنعندالوالدة‬
Distribution of tetanus neonatorum according to
age and sex
‫اال ت‬
‫عدد الح‬
8
‫ذكر‬
7
‫ا نثى‬
6
5
4
3
2
1
0
<=3 4-6 7-9 10 - 12 13 - 15 16 - 18 19 - 21 22 - 24
‫م‬
) ‫المجموعاتال عمرية(يو‬
‫‪another presentation of bar chart‬‬

‫اال ت‬
‫عدد الح‬
‫‪12‬‬

‫‪10‬‬
‫ا نثى‬
‫ذكر‬
‫‪8‬‬

‫‪6‬‬

‫‪4‬‬

‫‪2‬‬

‫‪0‬‬
‫‪<=3‬‬ ‫‪4-6‬‬ ‫‪7-9‬‬ ‫‪10 - 12‬‬ ‫‪13 - 15‬‬ ‫‪16 - 18‬‬ ‫‪19 - 21‬‬ ‫‪22 - 24‬‬
‫م‬
‫المجموعاتال عمرية(يو )‬
Pie chart
 Used for qualitative data if the
distribution is in %
‫الدائرة‬
‫وهي أجزاء من الكل‬

‫‪11%‬‬
‫‪21%‬‬
‫‪22%‬‬
‫‪27%‬‬
‫‪19%‬‬

‫توزع حاالت الكبد الفيروسى فى مستشفى الحميات‬


‫عام ‪1998‬‬
‫‪HCV‬‬ ‫‪HBV‬‬ ‫‪HAV‬‬ ‫‪HEV‬‬ ‫‪Non A-E‬‬
‫‪HCV‬‬
Histogram of SI

25.0 Bins of size 0.1

Note the “right


16.7 skew”
Percent

8.3

0.0
0.0 0.7 1.3 2.0
SI
Histogram
6.0 100 bins (too much detail)

4.0
Percent

2.0

0.0
0.0 0.7 1.3 2.0
SI
Histogram
200.0 2 bins (too little detail)

133.3
Percent

66.7

0.0
0.0 0.7 1.3 2.0
SI
Some histograms from our
class data (n=18 so far…)
Starting with politics…
Feelings about math and
writing…
Measures of Central
Tendency

Basic Measures:
Arithmetic Mean
Median
Mode
Arithmetic Mean

Definition:

Summation of values
divided by its number
Arithmetic Mean
Example:
Monthly income of 5 employees are:
100, 300, 400, 200, 500
L.E. Calculate their arithmetic
mean:
Arithmetic mean = sum of values / n
= 100+ 300+ 400+ 500+ 200
1500 / 5 = 300 L.E.
Arithmetic Mean
Example
Monthly income of 5 employees are:
100 ; 200 ; 300 ; 400 ; 1500
Calculate their mean:
Arithmetic mean = sum of values / n =
(100 + 200 + 300 + 400 + 1500) / 5 =
500 L.E.
2500 / 5 = 500 L.E.
arithmetic
N.B. extreme value affect the value of
mean where we can use the median
Median
Definition:
The value that divides the
data into two equal sets after
arrangement in descending or
ascending order.
Median
To calculate the median you need to:
1. Arrange the values in ascending or
descending order.
2. Determine location of median:
n+1/2
* Odd number, the location is direct
* Even number, the location is midpoint between
two values
3. Determine the Value of the median
* Odd number, the value is direct
* Even number, the value is sum of previous two
values/2.
Median
Example

Number of children of some


families were 6, 4, 5,
0, 1, 3, 2
Calculate the median
Median
1. Values arranged in descending
order
6, 5, 4, 3, 2, 1, 0
2. The median location =
n+1/2 = 7+1/2 =4
3. So the Fourth value in the
arranged set is the location of
median
4. Value of the median = 3
Median
Example
Number of children of some
families were 6, 4,
5, 1, 3, 2
Calculate the median
Median
1. Values arranged in descending
order
6, 5, 4, 3, 2, 1
2. location = 6+1/2 = 3.5

3. Value of the median =

Value of 3rd location + value of 4th


location/2
= 3.5
Mode

Definition
The most frequent
value
Mode
Example
Number of children of some
families were 2, 4, 3, 0, 1, 3
Calculate the mode
THANK YOU
Measures of central
tendency
 Mean
 Median
 Mode
Central Tendency
 Mean – the average; the balancing
point

calculation: the sum of values divided


by the sample size
n
In math ∑x X1 + X 2 +  + X n
shorthan i =1
d: X= =
n n
Mean: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

∑X i
17 + 19 + 21 + 22 + 23 + 23 + 23 + 38
i =1
X= = = 23.25
n 8
Mean
 The mean is affected by extreme values
(outliers)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4
1  2  3  4  5 15 1  2  3  4  10 20
 3  4
5 5 5 5
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Central Tendency
 Median – the exact middle value

Calculation:
 If there are an odd number of

observations, find the middle value


 If there are an even number of

observations, find the middle two values


and average them.
Median: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

Median = (22+23)/2 = 22.5


Median
 The median is not affected by
extreme values (outliers).

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3
Central Tendency
 Mode – the value that occurs most
frequently
Mode: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

Mode = 23 (occurs 3
times)
Measures of
Variation/Dispersion
 Range
 Percentiles/quartiles
 Interquartile range
 Standard deviation/Variance
Range

 Difference between the largest and


the smallest observations.
Range of age: 94 years-15 years = 79 years
14.0

9.3
Percent

4.7

0.0
0.0 33.3 66.7 100.0
AGE (Years)
Quartiles
25% 25% 25% 25%

Q1 Q2 Q3

 The first quartile, Q1, is the value for which 25% of


the observations are smaller and 75% are larger
 Q2 is the same as the median (50% are smaller,
50% are larger)
 Only 25% of the observations are greater than the
third quartile
Interquartile Range

 Interquartile range = 3rd quartile –


1st quartile = Q3 – Q1
Interquartile Range:
age

Median
minimum Q1 (Q2) Q3 maximum

25% 25% 25% 25%

15 35 49 65 94

Interquartile range
= 65 – 35 = 30
Variance
 Average (roughly) of squared
deviations of values from the mean

2
 (x 
i X) 2

S  i
n 1
Why squared deviations?
 Adding deviations will yield a sum of
0.
 Absolute values are tricky!
 Squares eliminate the negatives.

 Result:
 Increasing contribution to the variance
as you go farther from the mean.
Standard Deviation
 Most commonly used measure of
variation
 Shows variation about the mean
 Has the same units as the original
n
data
 (x i
2
X)
S i
n 1
Calculation Example:
Sample Standard
Deviation
Age data (n=8) : 17 19 21 22 23 23 23 38

n=8 Mean = X = 23.25

(17  23.25) 2  (19  23.25) 2    (38  23.25) 2


S
8 1
280
 6.3
7
Std. dev is a measure of
the “average” scatter
14.0
around the mean.

Estimation method: if
the distribution is bell
9.3
shaped, the range is
Percent

around 6 SD, so here


rough guess for SD is
79/6 = 13
4.7

0.0
0.0 33.3 66.7 100.0
AGE (Years)
Std. Deviation age

Variation Section of AGE


Standard
Parameter Variance Deviation
Value 333.1884 18.25345
Std. Deviation SI

Variation Section of SI

Standard Std Error Interquartile


Parameter Variance Deviation of Mean Range
Range
Value 4.155749E-02 0.2038566 6.681129E-03 0.2460432
1.430856
Std. Deviation PE
Variation Section of PE
Standard
Parameter Variance Deviation
Value 0.156786 0.3959621
Comparing Standard
Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 3.338

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 0.926

Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 4.570
Symbol Clarification
 S = Sample standard deviation
(example of a “sample statistic”)
  = Standard deviation of the
entire population (example of a
“population parameter”) or from a
theoretical probability distribution
 X = Sample mean
 µ = Population or theoretical mean
**The beauty of the normal curve:

No matter what  and  are, the area between - and


+ is about 68%; the area between -2 and +2 is
about 95%; and the area between -3 and +3 is
about 99.7%. Almost all values fall within 3 standard
deviations.
68-95-99.7 Rule

68% of
the data

95% of the data

99.7% of the data


Summary of Symbols
 S2= Sample variance
 S = Sample standard dev
 2 = Population (true or theoretical)
variance
  = Population standard dev.
 X = Sample mean
 µ = Population mean
 IQR = interquartile range (middle 50%)
References
 http://www.math.yorku.ca/SCS/Gallery/
 Kline et al. Annals of Emergency Medicine 2002; 39: 144-152.
 Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
 Tappin, L. (1994). "Analyzing data relating to the Challenger disaster". Mathematics Teacher, 87, 423-
426
 Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut, 1983.
 Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot
Wainer, H. 1997.

You might also like