0% found this document useful (0 votes)
8 views

BIO 2226 Lecture Notes 2018-1

Uploaded by

nuradeengidado
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

BIO 2226 Lecture Notes 2018-1

Uploaded by

nuradeengidado
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

BIO 2226: Biostatistics (2 credits)

• Concept of Biostatistics
• Use of statistical methods in Biology and
Agriculture;
• Frequency distribution,
• Law of probability,
• Binomial, Poisson and normal
distributions;
• Estimations and test of hypothesis;
• Design of simple Agricultural and Biological
experiments;
1
• Analysis of variance and co-variance;
• Simple regression and correlation;
• Contingency Tables,
• Some non-parametric tests.

2

STATISTICS AND BIOSTATISTICS
The word statistics is derived from the Latin word
‘states’ indicating the historical importance of
governmental data gathering / demographic
information
• Statistics, then, is the systematic ways of collecting,
organizing, summarizing and describing quantifiable
data, and methods of drawing inferences and
generalizing upon them.
• While the term Biostatistics / biometry is used when
the data that are being analysed using statistical
tools, are derived from the fields of biological
sciences: Medicine, Pharmacy, Biochemistry,
Microbiology, Agricultural Sciences and other
biology-related areas.
3
USE OF STATISTICS IN BIOLOGY,
AGRICULTURE AND MEDICINE.
• Variation is regarded as a fundamental feature in
natural sciences of biology, agriculture
and
medicine.
• Biostatistics helps to explain natura
variation this in l
inherent sciences. thesefieldsof natura
• For example, variation may occur due to lage of
the population or may occur among individuals
of a population due to diseases or their genetic
makeup
4
• Experimental design is an important aspect of
biostatistics that describe on how to collect,
organize, summarize and analyze data such
that valid and objective conclusions or
decision about the population can be drawn.
• The study of biostatistics has provided
answers to all of the above mentioned items.

5
Testing Hypotheses
• Hypothesis testing and estimation are used to
reach conclusions about a population by
examining a sample of that population.
• Hypothesis testing is widely used in medicine,
dentistry, health care, biology and other fields as
a means to draw conclusions about the nature of
populations.

6
Hypothesis testing
• Hypothesis testing is to provide information in
helping to make decisions.
• The administrative decision usually depends a
test between two hypotheses.
• Decisions are based on the
outcome.

7
Definitions
Hypothesis: A hypothesis is a
populations. statement about one or
hypotheses. more There are research hypotheses
• Estimation and is thestatistical
entire process of using an estimator
to
produce an estimate of a parameter.
• Estimation and hypothesis testing are interrelated.
• Anestimate is any specific value of a statistic while
an estimator is any statistic used to estimate a parameter
• For example, the sample mean x is used to estimate
the
populations mean μ
8
Definitions
• Research hypothesis: A research hypothesis
is the supposition or conjecture that
motivates the research.
• It may be proposed after numerous repeated
observations.
• Research hypotheses lead directly to statistical
hypotheses.

9
Definitions
• Statistical hypotheses: Statistical hypotheses
are stated in such a way that they may be
evaluated by appropriate statistical
techniques.
• There are two statistical hypotheses
involved
in hypothesis testing.

10
1
0
Definitions
• Hypothesis testing is a procedure to support
one of two proposed hypotheses.
• is the null hypothesis or the hypothesis of no
difference.
• (otherwise known as ) isthe alternative
hypothesis. It is what we will believe is true if
we reject the null hypothesis.

11
• There are two types of statistical hypotheses for
each situation:
• The null hypothesis
• and the alternative / alternate hypothesis.
• 1-The null hypothesis, symbolized by Ho, is a
statistical hypothesis that states that there is no
difference between a parameter and a specific
value .
• or that there is no difference between
two parameters .
• It can be accepted or rejected as the case
may
be. 12
• If it conforms sufficiently closely in a
statistical
sense, it is accepted, if it doesn’t, it is rejected.
• If the sample results do not support the null
hypothesis then the conclusion which is on the
rejection of the null hypothesis is known as the
alternative hypothesis .
• 2-Alternative Hypothesis, symbolized by H1, is
the conclusion to be drawn on the rejection of
the null hypothesis i.e. it states that there is a
difference between a parameter and a specific
value under study.
13

• In the process of hypothesis testing, the null hypothesis
initially is assumed to be true
• Data are gathered and examined to determine whether the
evidence is strong enough with respect to the alternative
hypothesis to reject the assumption.
• In another words, the burden is placed on the researcher to
show, using sample information, that the null hypothesis is
false.
• If the sample information is sufficient enough in favor of
the alternative hypothesis, then the null hypothesis is
rejected. This is the same as saying if the persecutor has
enough evidence of guilt, the “innocence is rejected.
• Ofcourse, erroneous conclusions are possible, type I
and
type II errors.
14
• For example supposing we want to test the
hypothesis that a population mean (μ) is equal
to 12. The hypothesis is that: Ho: μ = 12.
• Where: is the true value and 12 is the
μ
assumed value
• Therefore the three possible
hypotheses are
alternative
:
• 1. H1: μ ≠ 12 ⇒ expressed in Two-tailed test
• 2. H1: μ > 12 ⇒ expressed in Right-tailed test
• 3. H1: μ < 12 ⇒ expressed in Left-tailed test.

15
• A statistical test uses the data obtained from a
sample to make a decision about whether the
null hypothesis should be rejected.
• A one-tailed test is either right- tailed when the
inequality sign is > or left-tailed when the
inequality sign <.
• It indicates that the Ho should be rejected when
the test value is in the critical region on one
side of the mean.
• In a two-tailed test, the null hypothesis should
be rejected when the test value (numerical
value obtained from a statistical test) is in
either of the two critical regions.
16
Rejection Regions or Critical Values Approach:
Level of significance =  Non-rejection region
Represents
critical value
H0: μ = 12 a /2 a /2
H1: μ ≠ 12
Two-tail test 0
-1.96

H0: μ ≤ 12 H1: a Rejection


μ > 12 region is
0 shaded
Upper-tail test

H0: μ ≥ 12
a
H1: μ < 12
Lower-tail test 0
17
18
• THE LEVEL OF CONFIDENCE
• The level of probability associated with an
interval estimate is known as the confidence
level or degree of confidence or the
confidence coefficient.
• ‘Confidence’ is applied or used because the
probability is an indicator of the degree of
certainty that the particular method of
estimation will produce an estimate which
includes μ.
• The most frequently used confidence levels in
interval estimation are 90, 95 and 99 percent
as summarized below :
19
Probability levels Confidence z- value Form of the interval
Coefficient estimate

0.1 90 1.64 x - 1.64σx < μ < x + 1.64σx

0.05 95 1.96 x - 1.96σx < μ < x + 1.96σx

0.01 99 2.58 x - 2.58σx < μ < x + 2.58σx

Where: x = Sample mean


μ = Population mean
σx = standard error of 20
2
mean x. 0
• The significance difference is the degree of
difference between sample mean (x) and
population mean (μHo) that leads to the rejection
of the null hypothesis
• This is because it has only 10%, 5%, 1% or
less chance of occurring.
• In two-tailed test, once the Ho cannot
accepted,
be it is concluded that the hypothesized
value and the true value are not the same.
• The critical region is the range of values of the
test values that indicates that there is significant
difference and that the null hypothesis should be
rejected
• This tells us that the degree of
between the two means cannot wholly be
difference
explained by chance. 21
• Steps in Hypothesis Testing
• Every hypothesis testing situation begins with the
statement of hypothesis
• Determine the type of data, that is whether the data is
continuous or discrete
• State the hypotheses. Be sure to state both the null and
alternative hypotheses
• Design the study. This step involves:
1 Selecting the correct statistical test
2 Choosing a level of significance
3 Formulating a plan to carry out
the study
• Conduct the study and collect the
data
•• Evaluate
Summarize thetheresults.
data. Make the decision to reject or
23
2
2
Steps to Hypothesis testing, continued
Make statistical decision

Do not Reject H0 Reject H0

Conclude H0 may be true Conclude H1 is “true”


(There is sufficient evidence of H1)

Make
management/business/admini
strative decision

23
Purpose of hypothesis testing
• Hypothesis testing is to provide information in
helping to make decisions.
• The administrative decision usually depends on
the null hypothesis.
• If the null hypothesis is rejected, usually the
administrative decision will follow the alternative
hypothesis.

24
Purpose of hypothesis testing
• It is important to remember never to base
any decision solely on the outcome of only
one test.
• Statistical testing can be used to provide
additional support for decisions based on
other relevant information.

25
TYPE I AND TYPE II ERRORS
• Inhypothesis testing situation, there are
four
possible outcomes.
• That is, two possibilities of incorrect decisions,
• together with the two possibilities for correct
decisions.
• A Type I error occurs if one rejects the null
hypothesis when it is true or it is the risk that a
true hypothesis will be rejected.
• A Type II error occurs if you do not reject the null
hypothesis when it is false or when a false
hypothesis is erroneously accepted as true. 26
TYPES OF ERRORS (TYPE 1 ands 2)
TYPE OF HYPOTHESIS H0 ACCEPTED H1 ACCEPTED

Ho True Correct Type 1 Error


Decision

H1 True Type II error Correct Decision

27
 Type I and Type II errors cannot happen at the same
time
1. Type I error can only occur if H0 is true
2. Type II error can only occur if H0 is false
3. There is a tradeoff between type I and II errors. If
the probability of type I error (  ) increased, then
the probability of type II error ( β ) declines.
4. When the difference between the hypothesized
parameter and the actual true value is small, the
probability of type two error (the non-rejection
region) is larger.
5. Increasing the sample size, n, for a given level of
,
reduces β
28
Risk management
• Since rejecting a null hypothesis has a chance of
committing a type I error, we make  small by selecting
an appropriate confidence interval.
• Generally, we do not control , even though it
generally
is greater than . However, when failing to
reject a null hypothesis, the risk of error is unknown.

30
2
9
• TESTING A HYPOTHESIS INVOLVING A MEAN
• 1. Find the sample mean and standard deviations.
• 2. State the null hypothesis: Ho : x = μ
• 3. Use sample standard deviation (s) to estimate
population standard deviation (σ) and compute σx
= S/√n
• 4. Find the range for μ ± z σx
• 5. Check it if the value of x falls within the range
or not
• 6. If it does, then accept the Ho, and if it doesn’t,
then reject the Ho.
30
• EXAMPLES:
Example 1. Assuming the following values of a
sample are given: x = 454. n = 120, standard
deviation (S) = 27, μ = 460, α = 0.05 or 95
(confidence coeff.). Test the hypothesis and
make a decision
• Solution:
• State the null hypothesis.
• Ho: x = μ (That is, there is no difference
between the sample mean and the population
mean).
31
• σx = 27/√120 = 2.46
• Then calculate the range: μ + 1.96 x 2.46 TO μ
- 1.96 x 2.46
• 460 + 1.96 x 2.46 TO 460 - 1.96 x 2.46
• The range is: 465 TO 455
• So the sample mean, x = 454 does not
fall
within the range of 465 to 455.
• Therefore, the null hypothesis is rejected.
• This implies that the sample mean x and
the
population mean μ are significantly different32
• Example 2. The mean yield of tomato following
fertilizer treatment from 10 plots was 176.10 kg
with standard deviation 3.88. Estimate the 95%
confidence limit for the mean yield of tomato.
• Solution:
• μ = 176.10kg, α = 0.05, n = 10, σ = 3.88
• since the sample size (n = 10) is small i.e less
than 30, the t-distribution is used instead.
• μ ± tn-1 (σ / √n)
• tn-1 = t 10-1 = t 9 ; check the value on the t-
distribution table that correspond to degrees of
freedom (9) at 0.05 level. This value is 2.262.
33
• Then σ / √n = 3.88 / √10 = 1.23
• To fix the limits: L1 = μ - tn-1 (σ / √n) = 176.10 –
2.262 (1.23)
• L1 = 173.318
• L2 = μ + tn-1 (σ / √n) = 176.10 + 2.262 (1.23)
• L2 = 178.882
• At 95% confidence, the true population mean (μ),
will lie between the limits, 173.318 and 178.882.
• The mean yield of tomato is 176.10kg and is
within the interval.
• Therefore, it is the true mean.
34
• TESTING A HYPOTHESIS INVOLVING TWO MEANS
Example 1: A sample of 158 obese girls of average
age 15 was analysed with respect to various
physical characteristics during early childhood. A
control group of 94 non- obese girls of similar age
and socioeconomic background was also
analysed.
• The following table gives the sample means and
standard deviations for two characteristics of the
two groups.

35
Obese group Non obese group

Birth weight x1 = 7.04, S1 = 1.2 x2 = 7.19, S2 = 0.9

One year weight x1 = 23.3, S1 = 2.8 x2 = 21.9, S2 = 3.0

(a) Use the data to test the null hypothesis : Ho : μ1 = μ2, at α = 0.05
for
i. Birth weight
ii. One year weight
(b) What conclusions can be drawn from the results in (a) i and ii? 3736
• SOLUTION:
• First we state the null hypothesis:
• There is no significant difference between obese
and non-obese girls at birth weight and one year
weight in the two populations.
• The Z formula for testing two means is given by:
Z = x1 - x2
√S12/n1 + S22/n2
• Where: x1 – mean value of the first group
• x2 - mean value of the second group
• S1 – standard deviation of the first group
• S2 - standard deviation of the second group
• n1 – number of the first sample (=158)
• n2 – number of the second sample (=94)
37
• Therefore,
• (a) i. Birth weight:
Z = 7.04 – 7.19 = - 0.15/0.133 = -1.13
√1.22/158 + 0.92/94
• (a) ii. One year weight
• Z = 23.3 – 21.9 = 1.4 = 3.67
• √2.82/158 + 3.02/94 √0.145
• Decision:
• 1. Since the Z – value (-1.13) for birth weight is within
the non critical region (i.e. within the range of -1.96 to
+1.96)
• With  = .05, the critical values of z are -1.96 and
+1.96. We reject H0 if z < -1.96 or z > +1.96.
• We accept the null hypothesis because -1.13 is
w i t h i n t h e r a n g e o f -1.96 and +1.96. This signifies
that the difference between obese and non-obese girls
at birth weight is statistically not significant 38
2. The z value (3.67) for one year weight falls
within the critical region (i.e outside the range
of -1.96 to +1.96) or z > +1.96.
• Therefore, we reject the null hypothesis
because the difference is statistically
significant.
• This signifies that there was difference
between obese and non-obese girls at
one year weight.

40
3
9
Correlation
•A correlation is a relationship between two variables.
•The data can be represented by the ordered pairs (x, y)

where x is the independent (or explanatory) variable,


and y is the dependent (or response) variable.
y
•A scatter plot can be used to
2
determine whether a linear
(straight line) exists x

correlation between two 2 4 6

variables.
Example: –2
x 1 2 3 4 5
y –4 –2 –1 0 2 –
4 4
0
Correlation

• Measures the relative strength of the linear relationship


between two variables
• Unit-less
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear
relationship
• The closer to 0, the weaker any positive linear
4
1
Linear Correlation
y y
As x increases, y As x increases, y
tends to tends to increase.
decrease.

x x
Negative Linear Correlation Positive Linear Correlation
y y

x x
No Correlation Nonlinear Correlation
4
2
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y

X X X
r = -1 r = -.6 r=0
Y
Y Y

X X X
r = +1 r = +.3 r=0 4
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall 3
Linear Correlation
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
4
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall 4
Linear Correlation
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
4
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall 5
Linear Correlation
No relationship

X
4
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall 6
Correlation Coefficient
•The correlation coefficient is a measure of the strength and the
direction of a linear relationship between two variables.
•The symbol r represents the sample correlation coefficient. The
formula for r is
r 
n  xy  x  y
.

n  x 2   x 2 n  y 2   y2

•The range of the correlation coefficient is 1 to 1. If x and y have


a strong positive linear correlation, r is close to 1.
•If x and y have a strong negative linear correlation, r is close to
1.
• If there is no linear correlation or a weak linear correlation, r is
close to 0. 4
7
Linear Correlation
y
•y
r = 0.88

• r = 0.91
x
Strong positive correlation
y
• x
•Strong negative correlation r = 0.07
•y
x
• r = 0.42 x
Weak positive
Nonlinear Correlation
correlation 4
8
Calculating a Correlation Coefficient
Calculating a Correlation Coefficient
In Words In Symbols

1.Find the sum of the x-values. x


2.Find the sum of the y-values. y
3. Multiply each x-value by its corresponding  xy
y-value and find the sum.
4.Square each x-value and find the sum.
5.Square each y-value and find the sum. x 2
6. Use these five sums to calculate y2
the correlation coefficient. n  xy  x  y
r  .
n  x 2   x  n  y 2   y 
2 2

Continued.
4
9
Correlation Coefficient
Example:
Calculate the correlation coefficient r for the following data.

x y
1 –3
2 –1
3 0
4 1
5 2

5
0
Correlation Coefficient
Example:
Calculate the correlation coefficient r for the following data.

x y xy x2 y2
1 –3 –3 1 9
2 –1 –2 4 1
3 0 0 9 0
4 1 4 16 1
5 2 10 25 4
 x  15  y  1  xy  9  x 2  55  y 2  15

n  xy  x  y 5(9)  151


r  
n  x 2   x2 n  y 2   y2 5(55)  15 2 5(15)    1 
2

60 There is a strong positive


  0.986
50 74 linear correlation between x
and y.
5
1
Correlation Coefficient
Example:
The following data represents the number of hours 12 different students watched
television during the weekend and the scores of each student who took a test the
following Monday.
a.) Display the scatter plot.
b.) Calculate the correlation coefficient r.

Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50

Continued.
5
2
Correlation Coefficient
Example continued:

Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50

y
100
80
60
score
Test

40
20
x
2 4 6 8 10
Hours watching TV Continued.
5
3
Correlation Coefficient
Example continued:

Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
xy 0 85 164 222 285 340 380 420 348 455 525 500
x2 0 1 4 9 9 25 25 25 36 49 49 100
y2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500

 x  54  y  908  xy  3724  x 2  332  y 2  70836

n  xy  x  y 1 2 ( 3 7 2 4 )  54 908 


r    0.831
n  x 2   x 2
n  y 2   y 2 12(332)  5 4 2
1 2 ( 7 0 8 3 6 )   908 
2

There is a strong negative linear correlation.


As the number of hours spent watching TV increases, the test
scores tend to decrease.
5
4
THANK YOU

You might also like