0% found this document useful (0 votes)
19 views

Chap 3. Data presentation

The document discusses methods of data organization and presentation in biostatistics, emphasizing the importance of summarizing raw data to reveal patterns. It covers frequency distributions, tables, graphs, and various statistical measures, including central tendency and dispersion. Additionally, it provides examples and exercises to illustrate the application of these methods in analyzing data.

Uploaded by

edison
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Chap 3. Data presentation

The document discusses methods of data organization and presentation in biostatistics, emphasizing the importance of summarizing raw data to reveal patterns. It covers frequency distributions, tables, graphs, and various statistical measures, including central tendency and dispersion. Additionally, it provides examples and exercises to illustrate the application of these methods in analyzing data.

Uploaded by

edison
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 72

Biostatistics

Chap 3. Methods of data organization and


presentation
• The data collected in a survey is called raw data.
• In most cases, useful information is not immediately
evident from the mass of unsorted data.

• Collected data need to be organized in such a way as to


condense the information they contain in a way that will
show patterns of variation clearly.

• Even small data sets are difficult to comprehend without


some summarization.
Note!!
• Precise methods of analysis can be
decided up on only when the
characteristics of the data are
understood.
3.1. Tables
Frequency Distributions
• When analysing voluminous data collected it is quite useful to
put them into compact tables.

• The presentation of data in a meaningful way is done by


preparing a frequency distribution of the variable

• A frequency distribution tells how often a variable takes on


each of its possible values.
• For two qualitative variables, a contingency tables or cross-
tabulation is useful.
Frequency Distributions
• A Relative Frequency Distribution presents the
corresponding proportions of observations within the
classes i.e (Frequency/ n) x 100 ( n= size of sample)

Gender Absolute Relative frequency Relative


Frequency ( proportion) frequency (%)

Females 10 10/18 (10/18) x 100


Males 8 8/18 (8/18) x 100

Total 18

5
Frequency distribution for continuous variables
• Frequency distributions present data in a relatively compact
form, gives a good overall picture, and contain information that
is adequate for many purposes, but there are usually some
things which can be determined only from the original data.

• The construction of grouped frequency distribution consists


essentially of four steps:
• (1) Choosing the classes,
• (2) sorting (or tallying) of the data into these classes,
• (3) counting the number of items in each class, and
• (4) displaying the results in the form of a table
Frequency distribution for continuous
variables

• Fist we group observations into classes by choosing


a set of contiguous non-overlapping intervals, called
class intervals (the observations can be grouped to
form a discrete variable from the continuous
variable).
Cumulative and Relative Frequencies:

• When frequencies of two or more classes are added


up, such total frequencies are called Cumulative
Frequencies.

• This frequencies help as to find the total number of


items whose values are less than or greater than
some value.

• Relative frequencies express the frequency of each


value or class as a percentage to the total frequency.
Mid-Point of a class interval and the
determination of Class Boundaries
• Mid-point or class mark (Xc) of an interval is the
value of the interval which lies mid-way between the
lower true limit (LTL) and the upper true limit (UTL)
of a class.
• It is calculated as:
• Xc= Upper Class Limit +Lower Class Limit
• 2
• The true limits are what the tabulated limits would
correspond with if one could measure exactly.
Frequency Distribution
(continuous variable)
Age of 6209 persons chosen randomly in Rwanda
Absolute Relative cumulative
Classes centres
frequencies frequencies frequencies
ci xi ni fi (%) Fi (%)
[0-10[ 5 311 5.0 5.0
[10-20[ 15 120 1.9 6.9
[20-30[ 25 2255 36.3 43.3
[30-40[ 35 2090
100 33.7 76.9

٪ X
[40-50[ 45 870 14.0 90.9
[50-60[ 55 399 6.4 97.4
[60-70[ 65 127 2.0 99.4
[70-80[ (75) 37 0.6 100.0
6209 100 10
Frequency Absolute/relative frequency
Relative
Frequency
Absolute

2255 or 36.3%

2090 or 33.7%
Histogram
TOTAL :
40,3% 2.500 6209 or
100%
32,2% 2.000

870 or 14%
24,2%
Frequency

1.500

16,1% 399 or 6.4%


311 or 5%

1.000
120 or 1.9%

37 or 0.6%
127 or 2%
8,1% 500

Mean = 33,03
Std. Dev. = 12,348
0% 0 N = 6.209
0 10 20 30 40 50 60 70 80
11
Age
Density of relative frequency
Absolute relative Density of relative
Classes Mid-
frequencies frequencies (%) frequencies
points
ci xi ni

[0-10[ 5 311 5,0 = (311/6209)x100 0,5

Relative Frequency/by
class interval, i.e 10
[10-20[ 15 120 1,9 0,19
[20-30[ 25 2255 36,3 3,63
[30-40[ 35 2090 33,7 3,37
[40-50[ 45 870 14,0 1,4
[50-60[ 55 399 6,4 0,64
[60-70[ 65 127 2,0 0,2
[70-80[ 75 37 0,6 0,06
6209 100 10
12
N.B.
When classes have the same intervals, we can use
• Absolute frequencies
• Relative frequencies
• Density of frequency

When classes are having different intervals, we use density of


frequency.
• Why ? To avoid the overestimation of the area of long interval
classes
=> to keep always the total area= 100% (ou 1).

13
3.2. Graphs
• Frequency distributions can be often displayed effectively using
graphs or diagrams
• Diagrams give a very clear picture of data
• The relationship between numbers of various magnitudes can
usually be seen more quickly and easily from a graph than from a
table.
• They have greater attraction and facilitate comparison.
• But it is not to be used when comparison is either not possible or
is not necessary.
• Diagrammatic representation is not an alternative to tabulation.
• It can give only an approximate idea and as such where greater
accuracy is needed diagrams will not be suitable.
Histogram
• For quantitative continuous data.
• Put the observation in the ascending order
• Take a number of classes near to N tot
• Define classes [1-2] [3-4] or [1-2[ [2-3[...
• Calculate the frequency (absolute, relative, cumulative) or
the frequency density for each class
• Draw a rectangle for each class.
• The base of the rectangle= the interval of the class
• The height of each bar gives the frequency in each interval.
• The area of the rectangle is proportional (not necessarly
equal) to number of observations of that class
• The total area equals the 100% of all observations

15
frequencies
Density of relative
continuous variable

Histogram

2.500

3,5
3
2.000

2,5
Frequency

1.500

2
1.000
1,5
1
500

0,5 Mean = 33,03


Std. Dev. = 12,348
0 N = 6.209
0 10 20 30 40 50 60 70 80
16
Age
Exercice no 1
The table shows the age distribution of 1st year students in
Law in the year of 2003

Draw the histogram of absolute frequences and relative


frequencies (= Percentage)
(one figure => 2 legendes). 17
Exercice n° 2
In a class, the marks (out of 20) of students in exam are :

9 15 15 7 11 12 14 10 11 8
8 11 11 14 8 10 11 11 10 11
7 15 12 6 14 9 15 8 8 14
15 10 11 13 11 11 15 12 15 10
11 9 8 13 9 8 13 14 15 15
10 10 7 15 15 7 14 9 3 10
15 10 15 8 15 8 14 9 6 13
12 11 9 9 13 14 8 13 8 5

Make a table of 10 classes, with equivalent interval (0-2; 2-4; 4-6;…18-20) of absolutes ,
relatives and cumulatives frequencies and the density of relative frequencies.

18
Exercise n° 3

absolute relative freq cumulative Dens of rel


Classes
freq (% ) freq (% ) freq

[0-2[ 0 0.00 0.00 0.00


[2-4[ 1 1.25 1.25 0.63
[4-6[ 1 1.25 2.50 0.63
[6-8[ 6 7.50 10.00 3.75
[8-10[ 19 23.75 33.75 11.88
[10-12[ 21 26.25 60.00 13.13
[12-14[ 10 12.50 72.50 6.25
[14-16[ 22 27.50 100.00 13.75
[16-18[ 0 0.00 100.00 0.00
[18-20[ 0 0.00 100.00 0.00
80 100 50

Draw the histogram of the densities of relative


frequencies.
19
Frequency Polygon

• If we join the midpoints of the tops of the adjacent


rectangles of the histogram with line segments a
frequency polygon is obtained.
• When the polygon is continued to the X-axis just out
side the range of the lengths the total area under the
polygon will be equal to the total area under the
histogram.
• Note that it is not essential to draw histogram in
order to obtain frequency polygon.
Frequency polygon
Ogive or Cumulative Frequency Curve

• When the cumulative frequencies of a


distribution are graphed the resulting curve is
called Ogive Curve.
• To construct an Ogive curve:
• i) Compute the cumulative frequency of the
distribution.
• ii) Prepare a graph with the cumulative frequency on
the vertical axis and the true upper class limits (class
boundaries) of the interval scaled along the X-axis
(horizontal axis).
• The true lower limit of the lowest class interval with
lowest scores is included in the X-axis scale; this is
also the true upper limit of the next lower interval
having a cumulative frequency of 0.
The line diagram
• The line graph is especially useful when a variable is measured at
each of many consecutive point in time.

• The time, in weeks, months or years is marked along the


horizontal axis; and the value of the quantity that is being studied
is marked on the vertical axis.

• The distance of each plotted point above the base-line indicates its
numerical value.

• The line graph is suitable for depicting a consecutive trend of a


series over a long period.
The line diagram
2.Bar Chart
• Bar diagrams are used to display absolute or relative
frequencies distribution or to compare the frequency
distribution of categorical variables ( ordinal or
nominal)

• When we represent data using bar diagram, all the bars


must have equal width and the distance between bars
must be equal.
A. Simple bar chart
• It is a one-dimensional diagram in which the
bar represents the whole of the magnitude.
The height or length of each bar indicates the
size (frequency) of the figure represented.
Bar chart
%
45
40
35
30
25
20
15
10
5
0
Single Married Divorced Widowed
Marital status
B. Multiple bar chart

• In this type of chart the component figures


are shown as separate bars adjoining each
other.
• The height of each bar represents the actual
value of the component figure.
• It depicts distributional pattern of more than
one variable
MultipleBar chart
%
50
Male
40 Female

30
20
10
0
Single Married Divorced Widowed
Marital status
C. Component (or sub-divided) Bar
Diagram
• Bars are sub-divided into component parts of
the figure.
• These sorts of diagrams are constructed when
each total is built up from two or more
component figures.
Component bar diagram
4. Pie-chart
• For displaying the relative frequency distribution of
qualitative or quantitative discrete data
• it is a circle divided into sectors so that the areas of the
sectors are proportional to the frequencies.
3.3.Summarizing data

• The data must be summarized as succinctly


(concisely, briefly) as possible, since the
number of sample points is frequently large
and it is easy to lose track of the overall
picture by looking at all the data at once.
3.3.1. Descriptive Statistics
• Quantities and techniques used to describe a
sample characteristic or illustrate the sample
data.

• For numeric variables, there are two commonly


reported types of descriptive measures:
location and dispersion
1.Measures of Central Tendency (Location)
• The tendency of statistical data to get concentrated at certain
values is called the “Central Tendency” and the various
methods of determining the actual value at which the data
tend to concentrate are called of Central Tendency or averages.
• Common measures of location are:
(i) The mean, represents the arithmetic average of all
measurements in the population.
(ii) the Median, represents the point where half the
measurements fall above it, and half the measurements fall
below it.
(iii) the Mode represents the value or class with the highest
frequency in the sample / population
37
a) The Mean

• Let x1,x2,x3,…,xn be the realised values of a


random variable X, from a sample of size n.
The sample arithmetic mean is defined as:
n
x 1
n  xi
i 1

38
Example
Example 1: The systolic blood pressure of seven
patients were as follows:
151, 124, 132, 170, 146, 124 and 113.

x
151  124  132  170  146  124  113
The mean is 7
137.14

39
x
 x The sum of
n

Example 2.
Marks out of
20 for 20
students

15 7 12 10 8
11 14 10 11 11
15 6 9 8 14
16 13 11 12 10

Mean= 223/20 =11,15


40
b)The Median and Mode
• If the sample data are arranged in increasing
order, the median is
(i) the middle value if n is an odd number, or
(ii) midway between the two middle values if n is
an even number
• The mode is the most commonly occurring
value.

41
2. Median:

Position of the median in value a rearranged into order of


magnitude (smallest first)
Sample size
2n2
Exemple: 4
Marks out of 20 for 5 students: 15 7 12 10 8

1 2 3 4 5
7 8 10 12 15
n=5
Position of the median = 3
Value of the median = 10
N.B.: Median= 50th percentile = P50
Example 1 – n is odd
The reordered systolic blood pressure data seen earlier are:

113, 124, 124, 132, 146, 151, and 170.

The Median is the middle value of the ordered data, i.e.


132.

Two individuals have systolic blood pressure = 124 mm Hg,


so the Mode is 124.

43
Example 2 – n is even
Six men with high cholesterol participated in a study to
investigate the effects of diet on cholesterol level. At the
beginning of the study, their cholesterol levels (mg/dL) were as
follows:
366, 327, 274, 292, 274 and 230.
Rearrange the data in numerical order as follows:

230, 274, 274, 292, 327 and 366.

The Median is half way between the middle two readings, i.e.
(274+292)  2 = 283.

Two men have the same cholesterol level- the Mode is 274.

44
1. Mode: the value or class with the highest
frequency in the sample / population
Marks over 15 7 12 10 8
20 of 20 11 14 10 11 11
students 15 6 9 8 14
(QCM) 16 13 11 12 10

Mode = 11
If continuous variable : modal Class

45
Exemples of unimodal distributions (one mode)

Rem: Bimodal distribution : 2 modes


Plurimodal distribution : several modes

46
Symetric Distribution and unimodal

Mean
median
47
Unimodal distribution with negative
skewness

mean median
48
Unimodal distribution with positive
Skewness

median mean
49
Skewness
• If extremely low or extremely high
observations are present in a distribution,
then the mean tends to shift towards those
scores.
• Based on the type of skewness, distributions
can be:
a) Negatively skewed distribution: occurs when
majority of scores are at the right end of the curve
and a few small scores are scattered at the left end.
b) Positively skewed distribution: Occurs when the
majority of scores are at the left end of the curve
and a few extreme large scores are scattered at the
right end.
c) Symmetrical distribution: It is neither positively nor
negativelyskewed. A curve is symmetrical if one half
of the curve is the mirror image of the other half.
c) Geometric mean

• GM is a type of mean or average, which indicates


the central tendency or typical value of a set of
numbers by using the product of their values (as
opposed to the arithmetic mean which uses their
sum).
• It is obtained by taking the nth root of the product of
“n” values, i.e, if the values of the observation are
demoted by
• then, GM =
GM
• For instance, the geometric mean of two
numbers, say 2 and 8, is just the square root
of their product; that is 2√2 × 8 = 4.
• As another example, the geometric mean of
the three numbers 4, 1, and 1/32 is the cube
root of their product (1/8), which is 1/2; that
is3√4 × 1 × 1/32 = ½
2. Measures of Dispersion

• Measures of dispersion characterise how spread


out the distribution is, i.e., how variable the data
are.
• Commonly used measures of dispersion include:
1. Range
2. Variance & Standard deviation
3. Coefficient of Variation (or relative standard
deviation)
4. Inter-quartile range

54
1.Range

1. Range:
The difference between the maximum and the
minimum value in the data set
Range = Max – Min
Eg. data: -4 -3 -1 1 3 5
Range = 5 – (-4) = 9
 easy to calculate;
• useful for “best” or “worst” case scenarios
• sensitive to extreme values
55
2. Variance
2. Variance: the mean of squared deviations from
the mean N
Always
 ( x   )²
i

Population :  ²  i 1 positive
N population size
n

 ( x  x)² i
Sample : s ²  i 1
(n  1)
Eg. data: 4 3 1 2 3 5
Mean: 18/6 = 3
Squared deviations from the mean: 1 0 4 1 0 4
Sum of Squared deviations from the mean : 10
Variance: S² = 10/5 = 2
56
3 . Standard Deviation
• The sample standard deviation, s, is the square-root
of the variance

n
 xi  x 
2

i 1
s
n 1

 s has the advantage of being in the same units


as the original variable x

57
Example
Data Deviation Deviation2
151 13.86 192.02
124 -13.14 172.73
132 -5.14 26.45
170 32.86 1079.59
146 8.86 78.45
124 -13.14 172.73
113 -24.14 582.88
Sum = 960.0 Sum = 0.00 Sum = 2304.86

x 137.14
58
Example (contd.)
7

 x  x 
2
i 2304.86
i 1

Therefore, 2304.86
s
7 1
19.6
59
4. Coefficient of Variation
• In some cases the varaince of a variable changes with its mean
• The coefficient of variation (CV) or relative standard deviation (RSD) is a measure of relative
variability.
• It is a ratio of data dispersion (standard deviation) to the mean and shows the extend of
variability in relation to the mean

s
CV   100%
x
• The CV is not affected by multiplicative changes in scale
• Consequently, a useful way of comparing the dispersion of variables measured
independently to the unit in which the measurement was taken

• Generally small values of CV are considered best, since that means that the variability in
measurements is small relative to their mean (measurements are consistent in their
magnitudes).
• i,e the higher the CV the greater the dispersion 60
Example
The CV of the blood pressure data is:

 19.6 
CV 100  %
 137.1 
14.3%

i.e., the standard deviation is 14.3% as large as


the mean.

61
5.Inter-quartile range
• The Median divides a distribution into two halves.

• The first and third quartiles (denoted Q1 and Q3) are defined
as follows:
– 25% of the data lie below Q1 (and 75% is above Q1),
– 25% of the data lie above Q3 (and 75% is below Q3)

• The inter-quartile range (IQR) is the difference between the


first and third quartiles, i.e.
IQR = Q3- Q1
62
4. Quartiles and interquartile range
• The quartiles: The points where there are25%, 50% and 75% of
the scores

1st Quartile \ P25: 1N  1


4
2nd Quartile \ P50 \ median: 2N  2
4
3rd Quartile \ P75:3N  3
4
• Interquartile range: P75 – P25
63
Example
The ordered blood pressure data is:
113 124 124 132 146 151 170

Q1 Q2 Q3

Inter Quartile Range (IQR) is 151-124 = 27

64
Exercise
In one class, the notes (out of 20) obtained in biostatistics from a
sample of students are as follows:

9, 13, 14, 18, 20, 12, 14, 10, 11, 19

Calculate measures of central tendency (mean, mode, median)


and dispersion (variance, standard deviation, CV, range, IQR).

65
Mean 14
Mode 14
Median 13.5

Variance 14.67
Std dev 3.83
Range 11
2.3.2. Box-plots
• A box-plot is a visual description of the
distribution based on
– Minimum
– Q1
– Median
– Q3
– Maximum
• Useful for comparing large sets of data

67
Building a box plot
1. Calculate important values

• 1st quartile : Q1/4


• Median : Q1/2
• 3rd quartile : Q3/4
LSV P25 Med P75 HSV
2. Calculate limit values
• Calculate interquartile range (IQR) : Q3/4 – Q1/4
• Low limit value : Q1/4 – 1.5.IQR
• High limit value : Q3/4 + 1.5.IQR
3. Look for subsequent values
• Low fence : the real low value> = low limit value
• High fence : the real high value < = high limit value
4. Look for outliers
68
Example 1
The height of 12 individuals arranged in
increasing order are:
62, 64, 68, 70, 70, 74, 74, 76, 76, 78, 78, 80
Calculate Q1, Q3 and median

69
Example 1: Box-plot

70
Remarks
• The box is always limited by Q1 andQ3
• But the whiskers can represent several things according
different authors/programs
 the minimum and the maximum
 The low and high subsequent values
 A standard deviation above and below the mean
http://en.wikipedia.org/wiki/Box_plot

• Importance of the Boxplot


 examination of the symetry of the distribution
 visualisation of outliers values

71
QUIZ 1/ 2 marks

Determine the type of data that we have

You might also like