0% found this document useful (0 votes)
3 views85 pages

Biostats - PST 426.sister HO Fawole

The document outlines research methodology and biostatistics, focusing on data summarization, probability, normal distribution, and sampling methods. It explains the importance of selecting representative samples and various types of data, including qualitative and quantitative variables. Additionally, it discusses statistical techniques for data presentation and analysis, emphasizing the need for a well-defined analysis plan.

Uploaded by

michaelatanniyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views85 pages

Biostats - PST 426.sister HO Fawole

The document outlines research methodology and biostatistics, focusing on data summarization, probability, normal distribution, and sampling methods. It explains the importance of selecting representative samples and various types of data, including qualitative and quantitative variables. Additionally, it discusses statistical techniques for data presentation and analysis, emphasizing the need for a well-defined analysis plan.

Uploaded by

michaelatanniyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

RESEARCH METHODOLOGY

AND BIOSTATISTICS
PST 426

REV. SISTER (DR) HENRIETTA O. FAWOLE, OSF


Contents
• Summarization and presentation of data

• Probability

• Normal distribution

• Sampling methods

• Tests of hypothesis

• Measurement of health
Summarization and
presentation of data
Reduction, summarization and
presentation of data
• Objects of measurements are referred to as cases or observations

• Quantities or objects being measured in a study are called variables,


because it varies

• The result of the measurement of a variable for any given case or


observation is referred to as its value, while several observations are
collectively known as data

• The aggregate of persons, objects or events is referred to as a population,


although in reality the whole population cannot be studied or observed

• Therefore a subgroup of the population, called a sample is studied


Reduction, summarization and
presentation of data (contd)

• To ensure that sample studied is representative of the


whole population, a random sample is usually taken in
which all population members have equal likelihood of
being selected for study inclusion.
Reduction, summarization and
presentation of data (contd)
Data (types of variables)
Qualitative Quantitative (metric)
•are categorical and have non- •have numeric outcomes such
numeric outcomes, with no natural as height, age, number of
ordering e.g., gender, disease children etc.
status, type of cars, hair colour etc..
Continuous variables Discrete variables
Nominal Ordinal •can take any value, frequently •have fixed values
within a given range which e.g., shoe size
could be from zero to infinity or number of people
e.g., weight, length
•Often result from
measurement and have units of
measurements
Reduction, summarization and
presentation of data (contd)
• Categorical data (variables) – can be nominal, ordinal

❖ Nominal – describes data of finite set of possible values without any particular
order. Examples include blood groups (O,A,B,AB), eye colour, marital status,
gender etc.

❖ Ordinal – describes data that are ranked and ordered. Examples include size of
container (small, medium, large); fatigue level (none, mild, moderate, severe). Also
Likert scale/responses may have the options ‘strongly agree’, ‘agree’, ‘neither agree
nor disagree’, ‘disagree’ and ‘strongly disagree’ – ordered scale

*Note that the difference in rank or ordered data may not be the same – note that
we cannot quantify these differences.
Reduction, summarization and
presentation of data (contd)
Easiest way to tell whether data is metric or not

Has the data got units? (this


includes numbers of things)

NO YES

Do the data come from


Can the data be put
counting or measuring?
in meaningful order?

NO YES Counting Measuring


Categorical Categorical Discrete Continuous
Nominal Ordinal Metric Metric

An alogrithm to help you identify data type

Adapted from Bowers (2014). Medical Statistics from Scratch: An Introduction for Health
Professionals
Reduction, summarization and
presentation of data (contd)
• A set of data can be difficult to interpret because
it contains a lot of information.

• Hence data should be summarised in meaningful


ways.

• The use of graphs and summary statistics should


be the first step in understanding data and prior
to any statistical analysis.
Need for data summarization
• Multiple population groups
• Multiple risk factors
• Multiple outcomes
➢All are often simultaneously of interest
• The process of summarizing data within and
across many domains and population groups
can seem daunting unless a well defined
analysis plan is articulated and implemented.
• In order for data summarization to be effective,
the analysis plan should be framed according to
each of the following:

– Purpose of the analysis


– Audience for the analysis
– Data availability and
– Data quality
Variables, methods and
presentation

• The opportunity for data summarization occurs


in three successive phases:

• Phase I: Selection of variables. This phase


includes the selection and definition of the
primary indicator or indicators that will be the
central focus of the analysis.
• Phase II: Selection of analytic methods. This
phase involves decisions about how the
indicators and other variables selected will be
examined.
• For example, some indicators might be presented as
counts and some as rates; some might be presented
as overall averages while others might be stratified
by person, place, time, or levels of risk; some
indicators might be combined into a composite
index; some indicators might be presented in their
original form while others might be transformed
into categories, ranks, or scores.

• In addition, comparisons might be made intuitively,


or formal statistical testing might be conducted.
• Phase III: Selection of presentation format. This
phase involves designing a report that effectively
communicates the results of the analysis.

• Written narrative, tables, charts, graphs, and


maps are each effective formats depending on
the type of data being presented.
• In the variable selection phase (Phase I), data
summarization choices are made to restrict the
actual amount of data to be analysed and
reported.

• Data summarization choices are made to both


restrict the amount of data and to increase its
interpretability.
Frequency
• Frequency table. A frequency table shows a
tally of the number of data observations in
different categories.
Examples of frequency tables
Hair colour Frequency Satisfaction with Frequency
(number of children) physiotherapy (number of
n=95 care patients)
Brown 49 n=475
Very satisfied 121
Dark 27
Satisfied 161
Blond 15
Neutral 90
Red 4
Dissatisfied 51

Nominal data Very dissatisfied 52

Ordinal data
Example of frequency tables
Parity Frequency Birthweight (g) Frequency
(number of (number of mothers) (number of
pregnancies) n=100 mothers)
0 49 n=100

1 18 1500-1999 2

2 17 2000-2499 5

3 11 2500-2999 27

4 2 3000-3499 28

5 1 3500-3999 27
6 1 4000-4499 9

7 0 4500-4999 2
8 0 Continuous metric data
9 0
10 1
Discrete metric data
Example of frequency tables

Parity Frequency
(number of (number of mothers)
pregnancies) n=100
0 49
1 18
2 17
3 11
4 2
5 1
>6 2

Open-ended groups
Histogram
• Once the frequency table or distribution of
raw data has been tabulated, a variety of
techniques is available for graphical
presentation of given set of measurements.
• For most continuous data sets, the best
diagram to use is histogram.
• Histogram have bars that are drawn to touch
each other – thus reflecting the continuity of
the data that make up the bars.
Histogram
Bar charts
• When the data are discrete and the
frequencies refer to individual values, they are
displayed graphically using a bar
charts/graphs.
• Bar charts involves plotting the frequency of
each category and drawing a bar with heights
of bars representing frequencies.
• Bar charts are drawn with a gap between
neighbouring bars so that they are easily
distinguished from histograms.
Frequency polygons
• For frequency polygon only the tops of the
bars are marked, and then these points are
joined by straight lines.
• Frequency polygons are particularly useful for
comparing two or more sets of data.
Frequency polygon
Summary measures and
measures of location
• In addition to the graphical techniques, it is often
useful to obtain quantitative summaries of
certain aspects of the data.
• Most simple summary measurements can be
divided into two types; firstly quantities which
are “typical” of the data, and secondly, quantities
which summarise the variability of the data.
• The former are known as measures of location
and the latter as measures of spread.
Sample mean:

• This is the most important and widely used


measure of location.
• This is the location measure often used when
talking about the average of a set of
observations.
Sample median:

• The sample median is the middle observation


when the data are ranked in increasing order.
• If there are an even number of observations,
there is no middle number, and so the median is
defined to be the sample mean of the middle two
observations.
• The sample median is sometimes used in
preference to the sample mean, particularly
when the data is asymmetric, or contains outliers.
Sample mode:

• The mode is the value which occurs with the


greatest frequency.
• Consequently, it only really makes sense to
calculate or use it with discrete data, or for
continuous data with small grouping intervals
and large sample sizes.
Measure of spread

• Knowing the “typical value” of the data alone


is not enough.
• It is also necessary to know how
“concentrated” or “spread out” it is.
• That is, “variability” of the data.
• Measures of spread is a method of quantifying
this idea numerically.
Range:

• This is the difference between the largest and smallest


observation.
• This measure can sometimes be useful for comparing
the variability of samples of the same size, but it is not
very robust, and is affected by sample size (the larger
the sample, the bigger the range)
• It is not a fixed characteristic of the population, and
cannot be used to compare variability of different sized
samples.
Quartiles and the interquartile range

• Note that the median has half of the data less


than it,
• the lower quartile has a quarter of the data
less than it, and
• the upper quartile has a quarter of the data
above it.
Probability - The Normal
distribution - Sampling methods
Probability
• Probability is expressed as a proportion of
between 0 and 1, where 0 means an event is
certain not to occur
• and 1 means an event is certain to occur
• If the probability (p) is 0.01 for an event – it is
unlikely to occur frequently ( the actual
chance is 1 in 100)
• If the p = 0.99 then the event is highly likely to
occur ( the chance is 99 times in 100).
• Chance behavior is unpredictable in the short
run, but has a regular and predictable pattern
in the long run.
• The probability of any outcome of a random
phenomenom is the proportion of times the
outcome would occur in a very long series of
repetitions.
• Sample Space - the set of all possible
outcomes of a random phenomenon
• Event - any set of outcomes of interest
• Probability of an event - the relative frequency
of this set of outcomes over an infinite
number of trials
• Pr(A) is the probability of event A
p (A)= number of occurrence of A______
total number of possible occurrences

e.g., we can predict that the probability of


throwing a head (H) with a fair coin is
p (H)= number of occurrence of H_____
total number of possible occurrences

p (H)= H_____ = 1 = 0.5


H+T(tails) 2
Parameters vs. Statistics
• A parameter is a number that describes the
population.
• Usually its value is unknown.
• A statistic is a number that can be computed
from the sample data without making use of
any unknown parameters.
• In practice, statistic is used to estimate an
unknown parameter.
Normal distribution
• The normal distribution is the most important
and most widely used distribution in statistics.
• It is sometimes called the "bell curve"
• It is also called the "Gaussian curve" after the
mathematician Karl Friedrich Gauss.
• Seven features of normal distributions:
– Normal distributions are symmetric around their mean.
– The mean, median, and mode of a normal distribution are
equal.
– The area under the normal curve is equal to 1.0.
– Normal distributions are denser in the center and less
dense in the tails.
– Normal distributions are defined by two parameters, the
mean (μ) and the standard deviation (σ).
– 68% of the area of a normal distribution is within one
standard deviation of the mean.
– Approximately 95% of the area of a normal distribution is
within two standard deviations of the mean.
Sampling methods
Sampling
• Sampling is a statistical procedure that is concerned with
the selection of the individual observation; it helps to
make statistical inferences about the population.
• The Main Characteristics of Sampling
• In sampling, it is assumed that samples are drawn from the
population and sample means and population means are
equal.
• A population can be defined as a whole that includes all
items and characteristics of the research taken into study.
• However, gathering all this information is time consuming
and costly. Therefore inferences are made about the
population with the help of samples.
• Probability sampling is the sampling technique
in which every individual unit of the
population has greater than zero probability
of getting selected into a sample.
• Non-probability sampling is the sampling
technique in which some elements of the
population have no probability of getting
selected into a sample.
• Random sampling:
• In data collection, every individual observation
has equal probability to be selected into a
sample.
• In random sampling, there should be no
pattern when drawing a sample.
• Types of random sampling:
• Simple random sampling: By using the
random number generator technique, the
researcher draws a sample from the
population called simple random sampling.
• Simple random samplings are of two types.
One is when samples are drawn with
replacements, and the second is when
samples are drawn without replacements.
• Equal probability systematic sampling: In this
type of sampling method, a researcher starts
from a random point and selects every nth
subject in the sampling frame. In this method,
there is a danger of order bias.
• Stratified simple random sampling: In stratified
simple random sampling, a proportion from
strata of the population is selected using simple
random sampling. For example, a fixed
proportion is taken from every class from a
school.
• The population is first split into groups. The
overall sample consists of some members from
every group. The members from each group are
chosen randomly.
• Multistage stratified random sampling: In
multistage stratified random sampling, a
proportion of strata is selected from a
homogeneous group using simple random
sampling. For example, from the nth class and
nth stream, a sample is drawn called the
multistage stratified random sampling.
• Cluster sampling: Cluster sampling occurs when a
random sample is drawn from certain
aggregational geographical groups.
• The population is first split into groups. The
overall sample consists of every member from
some of the groups. The groups are selected at
random.
• Multistage cluster sampling: Multistage cluster
sampling occurs when a researcher draws a
random sample from the smaller unit of an
aggregational group.
• Types of non-random sampling: Non-random
sampling is widely used in qualitative
research.
• Random sampling is too costly in qualitative
research.
• The following are non-random sampling
methods:
• Convenience sampling: The researcher chooses a sample
that is readily available in some non-random way.
• Availability sampling occurs when the researcher selects
the sample based on the availability of a sample.
• This method is also called haphazard sampling. E-mail
surveys are an example of availability sampling.
• Quota sampling: This method is similar to the availability
sampling method, but with the constraint that the sample
is drawn proportionally by strata.
• Expert sampling: This method is also known as judgment
sampling. In this method, a researcher collects the samples
by taking interviews from a panel of individuals known to
be experts in a field.
• Data collection is the process of gathering and
measuring information on variables of interest, in
an established systematic fashion that enables
one to answer stated research questions, test
hypotheses, and evaluate outcomes.
• The data collection component of research is
common to all fields of study including physical
and social sciences, humanities, business, etc.
While methods vary by discipline, the emphasis
on ensuring accurate and honest collection
remains the same.
• The importance of ensuring accurate and
appropriate data collection
• Regardless of the field of study or preference for
defining data (quantitative, qualitative), accurate
data collection is essential to maintaining the
integrity of research.
• Both the selection of appropriate data collection
instruments (existing, modified, or newly
developed) and clearly delineated instructions for
their correct use reduce the likelihood of errors
occurring.
• Consequences from improperly collected data
include
– inability to answer research questions accurately
– inability to repeat and validate the study
– distorted findings resulting in wasted resources
– misleading other researchers to pursue fruitless
avenues of investigation
– compromising decisions for public policy
– causing harm to human participants and animal
subjects
• Issues related to maintaining integrity of data collection:
• The primary rationale for preserving data integrity is to
support the detection of errors in the data collection
process, whether they are made intentionally (deliberate
falsifications) or not (systematic or random errors).
• ‘Quality assurance’ and ‘Quality control’ are two
approaches that can preserve data integrity and ensure the
scientific validity of study results.
• Quality assurance - activities that take place before data
collection begins
• Quality control - activities that take place during and after
data collection
• Quality Assurance
• Since quality assurance precedes data
collection, its main focus is 'prevention' (i.e.,
forestalling problems with data collection).
• This proactive measure is best demonstrated
by the standardization of protocol developed
in a comprehensive and detailed procedures
manual for data collection.
• Quality Control
• While quality control activities
(detection/monitoring and action) occur
during and after data collection, the details
should be carefully documented in the
procedures manual.
• Examples of data collection problems that
require prompt action include:
– errors in individual data items
– violation of protocol
– problems with individual staff or site performance
– fraud or scientific misconduct
Tests of hypothesis (significant
difference, correlation,
regression)
• A research hypothesis is a specific, clear, and
testable proposition or predictive statement
about the possible outcome of a scientific
research study based on a particular property of
a population, such as presumed differences
between groups on a particular variable or
relationships between variables.
• Specifying the research hypotheses is one of the
most important steps in planning a scientific
quantitative research study.
• A quantitative researcher usually states an a
priori expectation about the results of the study
in one or more research hypotheses before
conducting the study, because the design of the
research study and the planned research design
often is determined by the stated hypotheses.
• Thus, one of the advantages of stating a research
hypothesis is that it requires the researcher to
fully think ahead
• Before writing research hypotheses it is crucial to first
consider the general research question posed in a study.
• Hypotheses in Quantitative Studies
Research hypotheses in quantitative studies take a familiar
form: one independent variable, one dependent variable,
and a statement about the expected relationship between
them.
• Most researchers prefer to present research hypotheses in
a directional format, meaning that some statement is made
about the expected relationship based on examination of
existing theory, past research, general observation, or even
an educated guess.
• It is also appropriate to use the null
hypothesis instead, which states simply that
no relationship exists between the variables;
• The null hypothesis forms the basis of all
statistical tests of significance.
• Hypotheses in Qualitative Studies
Hypotheses in qualitative studies serve a very
different purpose than in quantitative studies.
• Due to the inductive nature of qualitative studies,
the generation of hypotheses does not take place
at the outset of the study.
• Instead, hypotheses are only tentatively proposed
during an iterative process of data collection and
interpretation, and help guide the researcher in
asking additional questions and searching for
disconfirming evidence.
Test of hypothesis

• Significant difference
• Significant correlations
• Regression analysis
Parametric and non-
parametric tests
Parametric Test
• The parametric test is the hypothesis test
which provides generalisations for making
statements about the mean of the parent
population
• The statistic rests on the underlying
assumption that there is the normal
distribution of variable and the mean is known
or assumed to be known
• It is assumed that the variables of interest, in
the population are measured on an interval
scale.
• Based on the parameters of the normal curve.
• Data must meet certain assumptions, or
parametric statistics cannot be calculated.
Nonparametric Test
• The nonparametric test is defined as the hypothesis
test which is not based on underlying assumptions,
i.e. it does not require population’s distribution to
be denoted by specific parameters.
• The test is mainly based on differences in medians.
• Hence, it is alternately known as the distribution-
free test.
• The test assumes that the variables are measured
on a nominal or ordinal level.
• It is used when the independent variables are non-
metric.
• Nonparametric statistics are not based on the parameters
of the normal curve.
• If data violate the assumptions of a usual parametric, the
nonparametric equivalent of the parametric test is used.
• Also consider using nonparametric equivalent tests when
there is limited sample sizes (e.g., n < 30) or when there are
outliers that cannot be removed.
• Though nonparametric statistical tests have more flexibility
than do parametric statistical tests, nonparametric tests are
not as robust; therefore, most statisticians recommend that
when appropriate, parametric statistics are preferred.
Selecting statistical tests
• There is a wide range of statistical tests.
• The decision of which statistical test to use
depends on the research design, the
distribution of the data, and the type of
variable.
• In general, if the data is normally distributed,
the parametric tests are preferred.
• If the data is non-normal, the non-parametric
tests are selected.
Common statistical tests and their uses

You might also like