0% found this document useful (0 votes)
12 views

Advanced Statistics Concepts

Statistics

Uploaded by

afs.seu.23
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Advanced Statistics Concepts

Statistics

Uploaded by

afs.seu.23
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

ADVANCED

STATISTICAL CONCEPTS
Dr. Amel Fayed
MD, MPH, PhD
Associate Professor Biostatistics
College of Medicine
PNU
Outlines

■ Variables, statistics and parameters

■ Population and sampling

■ z-test and quartiles

■ Hypothesis testing

■ Confidence intervals
How to Talk to a
Statistician?
■ “It’s all Greek to me . . .” Καλημέρ
α
Why Do I Need
STATISTICS?

■ Planning a study

■ Proposal writing

■ Data analysis and interpretation

■ Presentation and manuscript development


When Should I Seek a
Statistician’s Help?

■ Literature interpretation

■ Defining the research questions

■ Deciding on data collection instruments

■ Determining appropriate study size


Vocabulary
Definition of Statistics

Statistics is a collection of procedures and principles for

gathering data and analyzing information to help people

make decisions when faced with uncertainty.


Sources of
data

Records Surveys Experiments

Comprehensi
Sample
ve
Types of data

Consta
nt Variable
s
Data: Example

1) Age, Height, Weight


2) Grade in exams
3) Temperature
4) Gender
5) Pulse, blood pressure
6) Blood glucose, Hb.
Data
quantitati qualitativ
ve e
Ordinal
Continuous
(severity of
(heights,
pain, social
weights)
level)

Discrete
Nominal
(number of
(gender,
patients,
colours)
pulse)

■ Q: Can you classify smoking as quantitative, ordinal and nominal


variable?
Questionnaire
Data collection tool
Types of questions:
Open-ended questions
Close ended  provide no answer choices.
 These questions provide specific  They are easy to ask
 allow for a wide variety of responses
answer choices.
 Easier for analysis  How did you know about free clinics
 But can miss valuable information appointments?
  What is your main complaint?
How old are you?  How did you find the services in pediatrics clinic?
1) less than 20 years  Describe your pain?
2) 20-40 years
3) 40 years or more
 Are you diabetic?
1) yes
2) No Partially closed questions
3) I don’t know
 Are you Saudi?  How did you find your nurse?
1) yes 1) cooperative
2) no 2) Competent
3) Nice
4) Other………
SAMPLING
Knowing a whole from its part
populati
on

paramet sample
er

statistic

1. The research takes us from population to the sample.


2. Using the sample to compute the statistic.
3. The sample statistics (mean, SD) are used to compute the population
parameters
Why not just collect data
from the whole population?

Sometimes impractical, often impossible!


■ If we cannot measure everyone in the population,
does that mean we cannot study populations or
make any conclusions about them?

NO!
■ Data from a sample can tell us something about a
population
Population
■The entire collection of events of
interest

■E.g., collection of people you want


to understand

■Doesn’t necessarily mean persons,


it may be records, animals.
Sample

■Subset of events selected from

a population

■Intended to represent the

population
Generalizations will depend on how
well the sample represents the
population.

Representative sample = Sample


whose characteristics are
similar to population

Random sampling = each event in


the population has equal chance
of being selected for sample
Parameters and Statistic
The Sampling Design
Process
Define the
Population
Determine the Sampling
Frame
Select Sampling Technique(s)

Determine the Sample Size

Execute the Sampling Process


Sampling Methods
Probability Sampling Non-Probability
■ Simple random Sampling
sampling ■ Deliberate (quota)
■ Stratified random sampling
sampling ■ Convenience sampling
■ Systematic random
■ Purposive sampling
sampling
■ Cluster (area) random ■ Snowball sampling
sampling ■ Consecutive sampling
■ Multistage random
sampling
Probability Samples
Probability samples offer each
respondent an equal probability or
chance at being included in the
sample.
They are considered to be:
• Objective
• Representative
Non Probability Samples
A non probability sample relies on the
researcher selecting the respondents.
They are considered to be:
• Subjective
• Unrepresentative
Probability Sampling Methods

• Random Sampling
• Systematic Random
Sampling
• Stratified Random Sampling
• Cluster Random Sampling
• Multi-Stage Sampling
Simple Random Sample
• In order to be random, a full list of
everyone within a sample frame is required.
• Random number tables or a computer is
then used to select respondents at random
from the list.
Advantage
Most representative group
Disadvantage
Difficult to identify every member of a population
(sample frame)
How to select a simple
random sample
1. Define the population
2. Determine the desired sample size
3. List all members of the population or the potential
subjects
■ For example:
– 4th grade boys who have demonstrated problem
behaviors
– Lets select 10 boys from the list
Simple random sampling
Table of random numbers
684257954125632140
582032154785962024
362333254789120325
985263017424503686
Systematic Random Sampling
• This selection is like random
sampling but rather than use
random tables or a computer to
select your respondents you select
them in a systematic way.

• E.g. every tenth person on the


college list is selected.
Systematic random
Sampling
■ Technique
– Use “system” to select sample (e.g., every 5th item in
alphabetized list, every 10th name in phone book)
■ Advantage
– Quick, efficient, saves time and energy
■ Disadvantage
– System for selecting subjects may introduce systematic
error
Systematic Sampling
Select some starting point and then
select every K th element in the population
Example: patients number 1, 11,21, 31, 41,
51,61,71,81,91) will be in one group and patients
number 5,15,25,35,45,55,65,75,85,95) will be in
the other group
Stratified Random
Sampling
■ Technique
– Divide population into various strata
– Randomly sample within each strata
– Sample from each strata should be proportional
■ Advantage
– Better in achieving representativeness on
control variable
■ Disadvantage
– Difficult to pick appropriate strata
– Difficult to Identify every member in population
Stratified Random Sampling
Stratified Random
Sampling
Cluster Random Sampling

• Similar to stratified
sampling but the groups
are selected for their
geographical location
• i.e. school children within
a particular school.
• The school is the cluster
with the children being
selected randomly from
within the cluster
Cluster sampling

Section 1 Section 2

Section 3

Section 5

Section 4
Multistage random
sampling
■ Stage 1
– randomly sample clusters (schools)
■ Stage 2
– randomly sample individuals from the schools
selected
Sampling Methods
Probability Sampling
■ Simple random
sampling Non-Probability
Sampling
■ Stratified random
■ Deliberate (quota)
sampling
sampling
■ Systematic random ■ Convenience
sampling sampling
■ Cluster (area) random ■ Purposive sampling
sampling ■ Snowball sampling
■ Multistage random ■ Consecutive
sampling sampling
Convenience Sampling
• This involves selecting the nearest and
most convenient people to participate in
the research.
• This method of selection is not
representative and is considered a very
unsatisfactory way to conduct research.
Quota Sampling
■ For example interviewers might be tempted to interview those who
look most helpful. The problem is that these samples may be biased
because not everyone gets a chance of selection. This random element
is its greatest weakness and quota versus probability has been a matter
of controversy for many years
• This type of sampling
is used when the
research is focused
on participants with
very specific
characteristics such
as being members of
a certain group.
• Having identified and
contacted one gang
member the
researcher asks to be
put in touch with any
friends or associates
who are also group
members.
• This type of sampling
is not representative
however is useful,
especially where the
groups in the
research are not

Snowball socially organised i.e.


they do not have
clubs or membership

Sampling
lists.
■ Purposive sampling (criterion-based sampling)
– Establish criteria necessary for being included in
study and find sample to meet criteria
■ Solution: Screening
– Use random sampling to obtain a representative
sample of larger population and then those
subjects that are not members of the desired

Purposiv –
population are screened or filtered out
EX: want to study smokers but can’t identify all

e
smokers

Samplin
g
Consecutive
sampling
– Outcome of
1000
consecutive
patients
presenting to
the
emergency
room with
chest pain
NORMAL
DISTRIBUTION
Properties of the Normal
Distribution
■ Many continuous
variables have
distributions that are bell-
shaped and are called
normally distributed
variables.
■ The theoretical curve,
called the normal
distribution curve, can be
used to study many
variables that are not
normally distributed but
are approximately normal.
Areas Under the Normal Curve



 

    


     

STANDARD
NORMAL
DISTRIBUTION
Definition

Standard Normal Distribution:


a normal probability distribution that has a
mean of 0 and a standard deviation of 1.
Z score
x
z= s- x

■ X is the raw score


■ X is the average score
■ S is the standard deviation
If I want to test which is
smarter?

■ Raw score is 10 ■ Raw score is 15 • Raw score is 15


(the mean was 5 (the mean was 8 (the mean was
and the SD was 1 and the SD was 2 20 and the SD
was 2
■ Z score is +5 ■ Z score is +3.5
• Z score is -2.5
(above the mean (above the mean
by 5 SD) by 3.5 SD) (below the
mean by 2.5
Here’s how we pack a tonne
of info into one little number!
1. The sign of the z-score tells you whether it is above
or below the mean
– + is above
– - is below
2. The number tells you how far it is from the mean
– Big number = far from mean
– Little number = close to mean

52
Interpreting Z Scores

Whenever a value is less than the mean, its


corresponding z score is negative
Ordinary values: z score between –2 and 2 sd
Unusual Values: z score < -2 or z score > 2 sd
Percentiles and quartiles
An illustration of Quartile and Percentile
Higher 3rd Quartile or 75th
Percentile

25% of cases

2nd Quartile or
Median
25% of cases

1st Quartile or 25th


25% of cases Percentile

25% of cases Zero Quartile or Minimum

Lower
■ Notice that these three quartiles cut the data set into

four parts, hence the name quartiles:

1) the part between the minimum and Q1 (25%),

2) the part between Q1 and Q2 (25%),

3) the part between Q2 and Q3 (25%),

4) the part between Q3 and the maximum (25%)


The Five-Number
Summary
■ The five-number summary, which reports the
largest and smallest values of the data, the
quartiles and median, provides a compact
description of the data. In symbols, the five-
number summary is

■ Minimum, Q1, Median, Q3, Maximum.


Boxplot

■ A boxplot is a graph of the five-number summary.

■ A central box spans the quartiles Q1 and Q3.

■ A line in the box marks the median M.

■ Lines extend from the box out to the smallest and


largest observations.
Interquartile range

■ A measure of spread based on these quartiles is


the Interquartile range IQR =Q3 - Q1, the distance
between the quartiles. The IQR gives the spread in
data values covered by the middle half of the data.
HYPOTHESIS
TESTING

61
Vocabulary
■ Hypotheses: a statement of the research

question that sets forth the appropriate

statistical evaluation

– Null hypothesis “H0”: statement of

no differences or association between

variables
The Scientific Method
Observati
on

Hypothesi
s

Experime
nt

Revise H
Results

Evidence
Evidence
inconsisten
supports H
t with H
Hypothesis Testing

■Null Hypothesis:
H0 : it means there is no difference between the studied
groups

■Alternative hypothesis:
H1 : it means there is difference between the studied
groups
Level of confidence

■How much I’m sure about my


conclusions?
■Two important levels:
1) 95%, I’m sure 95% that there’s a
difference between the studied groups
2) 99%, I’m sure 99% that there’s a
difference between the studied groups
P value
■The lower the p-value, the more
"significant" the result.
■One often rejects a null
hypothesis if the p-value is less
than 0.05 or 0.01, corresponding
to a level of significance 5% or
1% respectively.
■If the P value is more than 0.05 in
any statistical test, this means
this result is not significant at
66
■When the value of P value
decreases ( near to 0) , it means
that there is more probability to
get a significant test.
■When the P value increases
( 0.05 or more ) there is more
probability to have insignificant
test.

67
Type I error
■ Type I error: When we reject the null hypothesis while it
was true.( False Positive results)
■ More serious than type II error.
■ Inversely related to type II error.
■ Probability of type I error is called α.
■ The α is usually = 0.01 or 0.05 and is the same as the
level of significance.

68
Type II error
■ It occurs when we accept the null hypothesis while it
was false. (False Negative results)
■ Less serious than type I error.
■ Probability of type II error is called β
■ The β = 1- Power
■ Power of the study is how strong the study is, how
much I can avoid the false negative results. We usually
use 0.80 (80%) as an accepted level of Power.

69
TYPE I
AND
TYPE II
ERROR
S
One sided Vs two sided
(one tail Vs two tails)
If We want to test that group “A” SBP is different from SBP of
group B
 The null hypothesis is “mean SBP in group A= the mean
SBP of group B”
 The alternative hypothesis is ” mean SBP in group A ≠
mean SBP in group B”
 Another alternative hypothesis “mean SBP in group
A> mean SBP in group B”
 Another alternative hypothesis is “mean SBP in group
A< mean SBP in group B”
CONFIDENCE
INTERVAL
CONFIDENCE
INTERVAL FOR
PROPORTIONS
Notation for
Proportions
p= population proportion

ˆp = nx sample proportion
of x successes in a sample of size n
(pronounced
‘p-hat’)

qˆ= 1 - pˆ = sample proportion


of failures in a sample size of n
Definition
Point Estimate

A point estimate is a single value (or


point) used to approximate a
population parameter.

 The sample proportion p is the best


point estimate of the population
proportion p.
Example: About 829 inpatients were surveyed, and
422 of them were diabetic. Using these survey
results, find the best point estimate of the proportion
of all inpatients who are diabetics.

Because the sample proportion is the best point


estimate of the population proportion, we
conclude that the best point estimate of p is
422/829=0.51.
Definition
Confidence Interval

 A confidence interval (or interval


estimate) is a range (or an interval)
of values used to estimate the true
value of a population parameter. A
confidence interval is sometimes
abbreviated as CI.
Definition
Confidence Interval
 A confidence level is the probability 1—
(often expressed as the equivalent
percentage value) that is the proportion of
times that the confidence interval actually
does contain the population parameter,
assuming that the estimation process is
repeated a large number of times.

This is usually 90%, 95%, or 99%.


( = 10%), ( = 5%), ( = 1%)
Confidence Interval

Point of estimate
Upper limit Lower limit

E E

30% 25% 20%

5% 5%
Overlapping of
Confidence Intervals
CONFIDENCE
INTERVAL FOR
RISK RATIO
RISK
RATIO
Interpretation of the
significance of OR and RR
■ Confidence interval

– If the confidence interval overlaps 1, not significant

– If both limits of the confidence interval exceeds 1 that

means exposure increases the risk

– If both limits of the confidence interval less than 1 that

means exposure decreases the risk


95% C.I =
1.98-3.45
Confiden
95%C. I= ce
interval
0.45-0.77 of RR
95%C.I=
0.80- 5.6
CORRELATION
Defining Correlation

■ Correlation analysis is used to measure strength of the


association (linear relationship) between two variables

– Only concerned with strength of


the relationship
– No causal effect is implied
■ These variables change together (direct or inverse)
■ Usually scale (interval or ratio) variables
Linear Correlation

Linear Curvilinear
relationships relationships
Y Y

X X

Y Y

X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Linear Correlation
Weak relationships
Strong
relationships
Y Y

X X

Y Y

X X
from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation

No relationship

X
from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
A statistic that quantifies a
relation between two
variables

Can be either positive or


Correlation negative
Coefficient
(r) Falls between -1.00 and
1.00

The value of the number


(not the sign) indicates the
strength of the relation
Positive
Correlation

Association between variables


such that high scores on one
variable tend to have high
scores on the other variable
A direct relation between
the variables
Association between infection control measures and reproduction
number. a The proportion of subjects vaccinated showed a
significant negative correlation with reproduction number (ρ =
−0.413, p = 0.029). However, b mask wearing, c hand washing
and d gargling with water were not significantly associated with
THANK YOU
SEE you soon in the second part

You might also like