0% found this document useful (0 votes)
70 views

Bio Statistics 3

This document provides an overview of descriptive statistics. It discusses measures of central tendency including the mean, median, and mode. It defines these terms and provides examples of calculating each measure. The document also covers measures of dispersion such as range, interquartile range, standard deviation, and variance. It defines these statistical concepts and illustrates how to compute them using example data sets. Overall, the document serves as an introductory guide to foundational descriptive statistics techniques for summarizing and analyzing sample data.

Uploaded by

Moos Light
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Bio Statistics 3

This document provides an overview of descriptive statistics. It discusses measures of central tendency including the mean, median, and mode. It defines these terms and provides examples of calculating each measure. The document also covers measures of dispersion such as range, interquartile range, standard deviation, and variance. It defines these statistical concepts and illustrates how to compute them using example data sets. Overall, the document serves as an introductory guide to foundational descriptive statistics techniques for summarizing and analyzing sample data.

Uploaded by

Moos Light
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Biostatistics

Lecture
Prepared by
Baneen Ahmed
DESCRIPTIVE STATISTICS:
Descriptive statistics are those statistical summarizing methods that
help measure properties of a numerical variable and calculate these
measures as either sample statistics or population parameters.
-A descriptive measure computed from the data of a sample is called
a statistic.
-A descriptive measure computed from the data of a population is
called a parameter
descriptive measure divided into three different groups:
1- Measures of central tendency (measures of location)
2- Measures of dispersion
3- Skewness and kurtosis

MEASURES OF CENTRAL TENDENCY:


Three commonly used measures are:
1-the arithmetic mean, (also known simply as the mean or average),
2-the median
3-the mode.

-Definition of the mean


The mean is a number obtained by adding all the values in a
population or sample and dividing by the number of values that are
added.
General Formula for the Mean:
Properties of the Mean
1. Uniqueness. For a given set of data there is one and only one
arithmetic mean.
2. Since each and every value in a set of data enters into the
computation of the mean, it is affected by each value. Extreme values,
therefore, have an influence on the mean and, in some cases, can so
distort it that it becomes undesirable as a measure of central
tendency.
Example :
Suppose the five physicians are surveyed to determine their charges
for a certain procedure. Assume that they report these charges:
$75, $75, $80, $80, and $280.
The mean charge for the five physicians is found to be $118,
a value that is not very representative of the set of data as a whole.
The single h atypical value had the effect of inflating the mean(

If the values occur in frequencies then the mean can be


calculated using the following formula

̅

=

Solving Steps :
first, arrange the data in ascending order
Second, multiply each value by its frequency.
Third, apply the values into the mean formula

̅

=
2. Median
An alternative measure of central location, perhaps second in
popularity to the arithmetic mean, is the median.
Suppose there are n observations in a sample. If these observations
are ordered from smallest to largest, then the median is defined as
follows:
Definition: The sample median is

(1) The ( ) observations if n is odd.

(2) The average of the ( ) and ( ) observations if n is even.

The rational for these definitions is to ensure an equal number of


sample points on both sides of the sample median.

The median is defined differently when n is even and odd because it


is impossible to achieve this goal with one uniform definition. For
samples with an add sample size, there is a unique central point; for
example, for sample of size 7, the fourth largest point is the central
point in the sense that 3 points are both smaller and larger than it.
For samples with an even size, there is no unique central point and
the middle 2 values must be averaged. Thus, for sample of size 8,the
fourth and the fifth largest points would be averaged to obtain the
median, since neither is the central point.

Example: Compute the sample median for the birth weight data
Solution: First arrange the sample in ascending order
Since n=20 is even,
Median = average of the 10th and 11th largest observation =
(3245 + 3248)/2 = 3246.5 g

Example: Consider the following data, which consists of white blood


counts taken on admission of all patients entering a small hospital on
a given day. Compute the median white-blood count (× ).

Solution: First, order the sample as follows. 3,5,7,8,8,9,10,12,35.


Since n is odd, the sample median is given by the 5th, ((9+1)/2)th,
largest point, which is equal to 8.
The principal strength of the sample median is that it is insensitive to
very large or very small values.
In particular, if the second patient in the above data had a white
blood count of 65,000 rather than 35,000, the sample median would
remain unchanged, since the fifth largest value is still 8,000.
Conversely the arithmetic mean would increase dramatically from
10,778 in the original sample to 14,111 in the new sample.

The principal weakness of the sample median is that it is determined


mainly by the middle points in a sample and is less sensitive to the
actual numerical values of the remaining data points.

3. Mode:
It is the value of the observation that occurs with the greatest
frequency. A particular disadvantage is that, with a small number of
observations, there may be no mode. In addition, sometimes, there
may be more than one mode such as when dealing with a bimodal
(two-peaks) distribution. It is even less amenable (responsive) to
mathematical treatment than the median. The mode is not often used
in biological or medical data.
Find the modal values for the following data
a) 22, 66, 69, 70, 73. (no modal value)
b) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value = 3.0 kg)
Skewness: If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift towards those
scores. Based on the type of skewness, distributions can be:
a) Negatively skewed distribution: occurs when majority of
scores are at the right end of the curve and a few small scores are
scattered at the left end.
b) Positively skewed distribution: Occurs when the majority of
scores are at the left end of the curve and a few extreme large scores
are scattered at the right end.
c) Symmetrical distribution: It is neither positively nor negatively
skewed. A curve is symmetrical if one half of the curve is the mirror
image of the other half.
In unimodal ( one-peak) symmetrical distributions, the mean, median
and mode are identical. On the other hand, in unimodal skewed
distributions, it is important to remember that the mean, median and
mode occur in alphabetical order when the longer tail is at the left of
the distribution or in reverse alphabetical order when the longer tail
is at the right of the distribution.

Measures of dispersion (variation):


1. Range
The range is defined as the difference between the highest and
smallest observation in the data. It is the crudest measure of
dispersion. The range is a measure of absolute dispersion and as such
cannot be usefully employed for comparing the variability of two
distributions expressed in different units.

Range = xmax – xmin


Where , xmax = highest (maximum) value in the given
distribution.
Xmin = lowest (minimum) value in the given distribution
In our example given above ( the two data sets)
* The range of data in set 1 is 70-
* The range of data in set 2 is 53-

- The pth percentile


is defined by:
(1) The (k+1)th largest sample point if np/100 is not an integer
(where k is the largest integer less than np/100)
(2) The average of the (np/100)th and (np/100 + 1)th largest
observation is np/100 is an integer.

The spread of a distribution can be characterized by specifying


several percentiles. For example, the 10th and 90th percentiles are
often used to characterize spread. Percentages have the advantage
over the range of being less sensitive to outliers and of not being
much affected by the sample size (n).

Example: Compute the 10th and 90th percentile for the birth weight
data.
Solution: Since 20×0.1=2 and 20×0.9=18 are integers, the 10th and
th
percentiles are defined by
th
percentile = the average of the 2nd and 3rd largest values =
(2581+2759)/2 = 2670 g
th
percentile=the average of the18th and 19th largest values =
(3609+3649)/2 = 3629 grams.
We would estimate that 80 percent of birth weights would fall
between 2670 g and 3629 g, which gives us an overall feel for the
spread of the distribution.
Other quantlies which are particularly useful are the quartiles of the
distribution. The quartiles divide the distribution into four equal
parts.
The second quartile is the median. The interquartile range is the
difference between the first and the third quartiles.
To compute it, we first sort the data, in ascending order, then find
the data values corresponding to the first quarter of the numbers
(first quartile), and then the third quartile. The interquartile range
(IQR) is the distance (difference) between these quartiles.

Example: Given the following data set (age of patients):-

find the interquartile range!

1. sort the data from lowest to highest


2. find the bottom and the top quarters of the data
3. find the difference (interquartile range) between the two quartiles.
st quartile = The {(n+1)/4}th observation = (2.25)th observation
= 21 + (23-21)x .25 = 21.5

rd quartile = {3/4 (n+1)}th observation = (6.75)th observation


= 32 + (42-32)x .75 = 39.5

Hence, IQR = 39.5 -

The interquartile range is a preferable measure to the range. Because


it is less prone to distortion by a single large or small value. That is,
outliers in the data do not affect the inerquartile range. Also, it can
be computed when the distribution has open-end classes

-Standard Deviation and Variance:


Definition: The sample and population standard deviations denoted
by S and σ (by convention) respectively are defined as follows:

∑ ( ̅)
√ √

∑( )
σ=√ =population standard deviation
This measure of variation is universally used to show the scatter of
the individual measurements around the mean of all the
measurements in a given distribution.

Note that the sum of the deviations of the individual observations of a


sample about the sample mean is always 0.
The square of the standard deviation is called the variance. The
variance is a very useful measure of variability because it uses the
information provided by every observation in the sample and also it
is very easy to handle mathematically. Its main disadvantage is that
the units of variance are the square of the units of the original
observations.

Thus if the original observations were, for example, heights in cm


then the units of variance of the heights are cm . The easiest way
around this difficulty is to use the square root of the variance (i.e.,
standard deviation) as a measure of variability.

Example: Areas of sprayable surfaces with DDT from a sample of 15


houses are as follows (m ) :

Find the variance and standard deviation of the above distribution.

The mean of the sample is 125 m .

Variance (sample) = s = Σ(xi –x) /n-


={ - - … - } -

= 178.71 (square metres)


Hence, the standard deviation = = 13.37 m .

- The coefficient of variation:


The standard deviation is an absolute measure of deviation of
observations around their mean and is expressed with the same unit
of the data. Due to this nature of the standard deviation it is not
directly used for comparison purposes with respect to variability.
Therefore, it is useful to relate the arithmetic mean and SD together,
since, for example, a standard deviation of 10 would mean something
different conceptually if the arithmetic mean were 10 than if it were
1000. A special measure called the coefficient of variation, is often
used for this purpose.
Definition: The coefficient of variation (CV) is defined by:

*
̅
The coefficient of variation is most useful in comparing the
variability of several different samples, each with different means.
This is because a higher variability is usually expected when the
mean increases, and the CV is a measure that accounts for this
variability.
The coefficient of variation is also useful for comparing the
reproducibility of different variables. CV is a relative measure free
from unit of measurement. CV remains the same regardless of what
units are used, because if the units are changed by a factor C, both
the mean and SD change by the factor C; the CV, which is the ratio
between them, remains uncharged.

Example: Compute the CV for the birth weight data when they are
expressed in either grams or ounces.

Solution: in grams Χ = 3166.9 g, S = 445.3 g,

CV=100% * ̅ = =
If the data were expressed in ounces, Χ =111.71 oz, S=15.7 oz,

Then CV = 100%* ̅
= =

The third lecture has ended


I wish you all the best.

You might also like