0% found this document useful (0 votes)
4 views

Review and Non Parametric Using SPSS 2023

The document provides a comprehensive overview of basic and advanced statistical tools using SPSS, emphasizing the importance of statistics in scientific research and the potential for misuse. It covers various statistical techniques, including t-tests, chi-squared tests, and confidence intervals, along with their applications in analyzing data. The presentation also highlights the significance of understanding assumptions in statistical tests and offers guidance on interpreting results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Review and Non Parametric Using SPSS 2023

The document provides a comprehensive overview of basic and advanced statistical tools using SPSS, emphasizing the importance of statistics in scientific research and the potential for misuse. It covers various statistical techniques, including t-tests, chi-squared tests, and confidence intervals, along with their applications in analyzing data. The presentation also highlights the significance of understanding assumptions in statistical tests and offers guidance on interpreting results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Reviewing Basic and

Advanced Statistical
Tools Using SPSS
Prepared by:
Armele J. Mangaran
Professor College of Science
Bulacan State University
February 11, 2023
It was either the famous writer Mark Twain or the 19th-century British
Prime Minister Benjamin Disraeli who said: “There are three kinds
of lies: lies, damned lies, and statistics.”

This saying means that statistics can be used for very wrong
purposes and the persuasive power of numbers can assist to
strengthen weak arguments.

As long ago as 1889, Carroll D. Wright (a prominent statistician) said,


“The old saying is that ‘figures will not lie,’ but a new saying
is ‘liars will figure’.”

This collection of statistics quotes tries to show the opposite and


make the point that statistics are connected to science. Statistics are
not there to pervert the truth in the interests of a theory somebody
By a
wishes tosmall sample,
establish. Errors using inadequate data
we may judge are much less
of the whole piece.
than those
Miguel de Cervantes,
from his novel “Don using no data at all.
Quixote” Charles Babbage
(1547 – 1616; Spanish (1791 – 1871, English polymath,
novelist, poet and inventor and mechanical
playwright) engineer)
Outline of Presentation

1. Review of Some Basic


concepts
2. Basic Statistical
techniques
3. Advance Statistical
techniques
Variables

Scale Categorical

Continuous
Measurements Discrete: Ordinal: Nominal:
takes any Counts/ integers obvious order no meaningful
value order
The research problem

 Problem  Methodology

1. Does on-line teaching 1. Traditional


provide better results experimental
than the face to face research
teaching?
2. What goes on in a
classroom near exam 2. Ethnographic
time? research
3. How do parents feel
about school 3. Survey research
counseling?
4. Do teachers treat the
genders differently? 4. Causal-comparative
research
Types of Data
Types of
Analysis Numerical Categorical
- Array, stem-leaf-plot, frequency - Summary table, bar chart,
distribution, relative frequency pie chart
distribution, cumulative
percentage distribution,
Describing
histogram, polygon, cumulative
a group or
frequency polygon.
Several - Mean, median, mode, quartiles,
Groups
range, interquartile range,
standard deviation, variance,
coefficient of variation,
coefficient of quartile deviation.
- Confident interval estimate for - confidence interval for the
mean proportion
Inference - -
z-test for mean z-test of hypothesis for
about one -
t-test for mean proportion
group - chi-square test for variance or
standard deviation
- test for the difference in the z-test for differences between
mean of two independent two proportions
populations - Chi-square test for the
Comparing -
paired t-test (related samples) difference between two
two groups -
wilcoxon rank sum test proportions
- Wilcoxon signed rank test - McNemar test for two
related samples
Types of Data
Types of Analysis
Numerical Categorical
- One-way analysis of - Chi-square test
variance for differences
- Kruskal-wallis test among more than
- Randomized block two proportions
Comparing more than design
two group - Two-way analysis of
variance
- Friedman test

Scatter diagram, time - Contingency


series plot table, clustered
- Covariance, bar chart
Analyzing the -
coefficient of Chi-square test
relation between two
correlation for independence
variables - Simple linear
regression
- T-test of correlation
- Multiple regression - Logistic
Analyzing the - Time series regression
relationship between forecasting
two or more variables
T-tests
Paired or Independent (Unpaired)
Data?
T-tests are used to compare two population
means
₋ Paired data: same individuals studied at
two different times or under two conditions
PAIRED T-TEST
₋ Independent: data collected from two
separate groups INDEPENDENT SAMPLES
T-TEST
Comparison of hours worked in 2010
to 2020
Paired or unpaired?
If the same people have reported their hours for 2010
and 2020 have PAIRED measurements of the same
variable (hours)
Paired Null hypothesis: The mean of the paired
differences = 0

If different people are used in 2010 and 2020 have


independent measurements
Independent Null hypothesis: The mean hours worked
in 2010 is equal to the mean for 2020
SPSS data entry
2010 2020
Paired Data Employee 1 35 38
Employee 2 37 35
Employee 3 20 30
Employee 4 25 28

Independent Groups

Name Hours year


1 Employee A 35 2010
2 Employee B 37 2010
3 Employee C 20 2010
4 Employee D 38 2020
5 Employee E 35 2020
6 Employee F 35 2020
What is the t-distribution?
 The t-distribution is similar to the standard normal
distribution but has an additional parameter called
degrees of freedom (df or v)

For a paired t-test, v = number of pairs – 1

For an independent t-test,

 Used for small samples and when the population


standard deviation is not known
 Small sample sizes have heavier tails
Example 1: t-Test Results
95% Confidence
Interval of the
Difference
Std. Std. Error
Mean Sig. (2-
Deviation Mean t df
tailed)
Lower Upper

Triglyceride level at
week 8 (mg/dl) -
-11.371 80.360 13.583 -38.976 16.233 -.837 34 .408
Triglyceride level at
baseline (mg/dl)
Null Hypothesis is:

P-value =
Decision (circle correct answer): Reject Null/ Do
not reject Null

Conclusion:
Ignore the
shaded part of the
Example 2: t-Test Results
output for now!
Levene's Test
95% CI of the
for Equality of T-test results
Difference
Variances
Sig. Mean Std. Error
F Sig. t df Lower Upper
(2-tailed) Difference Difference
Equal variances
2.328 .136 4.539 35 .000 3.648 .804 2.016 5.280
assumed
Equal variances
4.510 32.342 .000 3.648 .809 2.001 5.295
not assumed

Null Hypothesis is:

P-value =
Decision (circle correct answer): Reject Null/ Do
not reject Null
Conclusion:
Ignore the
shaded part of the
Example 2: Solution
output for now!
Levene's Test
95% CI of the
for Equality of T-test results
Difference
Variances
Sig. Mean Std. Error
F Sig. t df Lower Upper
(2-tailed) Difference Difference
Equal variances
2.328 .136 4.539 35 .000 3.648 .804 2.016 5.280
assumed
Equal variances
4.510 32.342 .000 3.648 .809 2.001 5.295
not assumed

H0: μnew = μplacebo As p < 0.05, DO reject the null

IS evidence of a
P(t< - P(t>4.539
difference in weight loss 4.539) )
between treatment and Is < 0.001 Is < 0.001
placebo

-4.539 4.539
Assumptions in t-Tests
 Normality: Plot histograms
 One plot of the paired differences for any paired data
 Two (One for each group) for independent samples
 Don’t have to be perfect, just roughly symmetric

 Equal Population variances: Compare sample standard


deviations
 As a rough estimate, one should be no more than twice
the other
 Do an F-test (Levene’s in SPSS) to formally test for
differences

 However the t-test is very robust to violations of the


assumptions of Normality and equal variances, particularly
for moderate (i.e. >30) and larger sample sizes
Levene’s Test for Equal Variances
from Examples 2
Levene's Test
95% CI of the
for Equality of T-test results
Difference
Variances
Sig. Mean Std. Error
F Sig. t df Lower Upper
(2-tailed) Difference Difference
Equal variances
2.328 .136 4.539 35 .000 3.648 .804 2.016 5.280
assumed
Equal variances
4.510 32.342 .000 3.648 .809 2.001 5.295
not assumed

Null hypothesis is that pop variances are equal


i.e. H0: s2new = s2placebo
Since p = 0.136 and so is >0.05 we do not reject the null
i.e. we can assume equal variances 
What if the assumptions are not met?

 There are alternative tests which do not have these


assumptions

Test Check Equivalent


non-parametric test
Independent t- Histograms of data Mann-Whitney
test by group
Paired t-test Histogram of Wilcoxon signed rank
paired differences
Sampling Variation

Every sample
taken from a
population, will
Sample A contain different
n=20
Mean = 277 numbers so the
mean varies.
Population Sample B
Mean =? n=50 Which estimate
SD =? Mean = 274 is most reliable?

Sample C How certain or


n=300 uncertain are
Mean = 275 we?
Confidence Intervals
A range of values within which we are confident (in
terms of probability) that the true value of a pop
parameter lies
A 95% CI is interpreted as 95% of the time the CI
would contain the true value of the pop parameter
 i.e.5% of the time the CI would fail to contain the
true value of the pop parameter
Exercise
 Discuss what the interpretation is for the confidence interval from
Example 2 (Weight loss was measured after taking either a new
weight loss treatment or placebo for 8 weeks) highlighted below:

Levene's Test
95% CI of the
for Equality of T-test results
Difference
Variances
Sig. Mean Std. Error
F Sig. t df Lower Upper
(2-tailed) Difference Difference
Equal variances
2.328 .136 4.539 35 .000 3.648 .804 2.016 5.280
assumed
Exercise: Solution
 Discuss what the interpretation is for the confidence interval from
Example 2 highlighted below:

Levene's Test
95% CI of the
for Equality of T-test results
Difference
Variances
Sig. Mean Std. Error
F Sig. t df Lower Upper
(2-tailed) Difference Difference
Equal variances
2.328 .136 4.539 35 .000 3.648 .804 2.016 5.280
assumed

The true mean weight loss would be between about 2


to 5 kg with the new treatment.
This is always positive hence the hypothesis test
rejected the null that the difference is zero
Two categorical variables
Are boys more likely to prefer maths and science than girls?

Variables:
 Favourite subject (Nominal)
 Gender (Binary/ Nominal)

Summarise using %’s/ stacked or multiple bar


charts
Test: Chi-squared
Tests for a relationship between two categorical
variables
Chi-squared test statistic

 The chi-squared test is used when we want


to see if two categorical variables are
related
 Thetest statistic for the Chi-squared test
uses the sum of the squared differences
between each pair of observed (O) and
expected values (E)

 
2
n
Oi  Ei 
2

i 1 Ei
Chi squared Test?
 Null: There is NO association between
class and survival
 Alternative: There IS an association between
class and survival
contingency
table
3x2
Using SPSS

Analyse  Descriptive Statistics  Crosstabs


Click on ‘Statistics’ button & select Chi-squared
Test Statistic = 127.859

p- value
p < 0.001

Note: Double clicking on the output will display the


p-value to more decimal places
Hypothesis Testing: Decision Rule
 We can use statistical software to undertake a
hypothesis test e.g. SPSS
 One part of the output is the p-value (P)

 If P < 0.05 reject H0 => Evidence of HA being


true (i.e. IS association)

 If P > 0.05 do not reject H0 (i.e. NO association)


Interpretation

Since p < 0.05 we reject the null

There is evidence (c22=127.86, p


< 0.001) to suggest that there is
an association between class and
survival Test Statistic =
127.859

But… what is the nature of this


association/relationship?
Low EXPECTED Cell Counts with the Chi-
squared test
Died Survive Total
d
We have no 1st Class 200 123 323
cells with
2nd Class 171 106 277
expected
counts 3rd Class 438 271 709
below 5 Total 809 500 1,309

SPSS
Output
Low Cell Counts with the Chi-squared test

 Check no. of cells with EXPECTED counts less than 5


 SPSS reports the % of cells with an expected count
<5
 If more than 20% then the test statistic does not
approximate a chi-squared distribution very well
 If any expected cell counts are <1 then cannot use
the chi-squared distribution
 In either case if have a 2x2 table use Fishers’
Exact test (SPSS reports this for 2x2 tables)
 In larger tables (3x2 etc.) combine categories to
make cell counts larger (providing it’s meaningful)
Scatterplot
Relationship between two scale variables:
 Explores the way the
two co-vary: (correlate)
₋ Positive / negative
₋ Linear / non-linear
₋ Strong / weak

 Presence of outliers

 Statistic used:
r = correlation coefficient
Correlation Coefficient r
 Measures strength of a relationship between
two continuous variables

r = 0.9

r = 0.01

r = -0.9
Correlation Interpretation
An interpretation of the size of the coefficient has
been described by Cohen (1992) as:

Correlation coefficient value Relationship

-0.3 to +0.3 Weak


-0.5 to -0.3 or 0.3 to 0.5 Moderate
-0.9 to -0.5 or 0.5 to 0.9 Strong
-1.0 to -0.9 or 0.9 to 1.0 Very strong

Cohen, L. (1992). Power Primer. Psychological


Bulletin, 112(1) 155-159
r-value Verbal Description
Little or weak positive(negative)
0.00 –
correlation
0.29
0.30 – Low positive(negative)
0.49 correlation
Moderate positive(negative)
0.50 –
correlation
0.69
0.70 – High positive(negative)
0.89 correlation
Very high positive(negative)
0.90 –
correlation
1.00
Hypothesis tests for r
Tests the null hypothesis that the population
correlation r = 0 NOT that there is a strong
relationship!

It is highly influenced by the number of


observations e.g. sample size of 150 will
classify a correlation of 0.16 as significant!

Better to use Cohen’s interpretation


Exercise
 Interpret the following correlation coefficients using Cohen’s
and explain what it means

Relationship Correlation

Average IQ and chocolate consumption 0.27

Road fatalities and Nobel winners 0.55

Gross Domestic Product and Nobel winners 0.7

Mean temperature and Nobel winners -0.6


Exercise - solution

Relationship Correlatio Interpretation


n
Average IQ and 0.27 Weak positive relationship. More
chocolate consumption chocolate per capita = higher
average IQ
Road fatalities and 0.55 Strong positive. More accidents
Nobel winners = more prizes!

Gross Domestic Product 0.7 Strong positive. Wealthy


and Nobel winners countries = more prizes

Mean temperature and -0.6 Strong negative. Colder


Nobel winners countries = more prizes.
Regression: Association
between two variables
 Regression is useful when we want to
a) look for significant relationships between two
variables
b) predict a value of one variable for a given value
of the other

It involves estimating the line of best fit through


the data which minimises the sum of the squared
residuals

What are the residuals?


Regression
Simple linear regression looks at the
relationship between two Scale variables by
producing an equation for a straight line of the
form

y a  x

Which uses the independent variable to predict the


dependent variable
Output from SPSS
 Key regression table:

Y = -6.66 + 0.36x P – value < 0.001

 As p < 0.05, gestational age is a significant predictor of birth


weight. Weight increases by 0.36 lbs for each week of gestation.
How reliable are predictions? – R2
How much of the variation in birth weight is
explained by the model including Gestational age?

Proportion of the variation in birth weight


explained by the model R2 = 0.499 = 50%
Predictions using the model are fairly reliable.

Which variables may help improve the fit of the


model?
Compare models using Adjusted R2
Assumptions for regression

Assumption Plot to check

The relationship between the Original scatter plot of


independent and dependent the independent and
variables is linear. dependent variables
Homoscedasticity: The variance of Scatterplot of
the residuals about predicted standardised
responses should be the same for predicted values and
all predicted responses. residuals

The residuals are independently Plot the residuals in a


normally distributed histogram
Checking normality

Histogram of the
residuals looks
approximately normally
distributed

When writing up, just


say ‘normality checks
were carried out on the
residuals and the
assumption of normality
was met’
Outliers are outside
Predicted values against
residuals
Are there any patterns as the predicted values increases?

There is a problem with Homoscedasticity if the scatter


is not random. A “funnelling” shape such as this
suggests problems.
What if assumptions are not
met?
 If the residuals are heavily skewed or the residuals

show different variances as predicted values increase,


the data needs to be transformed
 Try taking the natural log (ln) of the dependent
variable. Then repeat the analysis and check the
assumptions
Regression question

 Pre-pregnancy weight p-value:


 Regression equation:
 Interpretation:

R2 = 0.152
Does the model result in reliable predictions?
Check the assumptions
Correlation

 Pearson’s correlation = 0.39

 Describe the relationship using the scatterplot and correlation


coefficient

 There is a moderate positive linear relationship between


mothers’ pre-pregnancy weight and birth weight (r = 0.39).
Generally, birth weight increases as mothers weight increases
Regression
Pre-pregnancy weight p-value: p = 0.011
 Regression equation: y = 3.16 + 0.03x

 Interpretation:
 There is a significant relationship between a mothers’ pre-pregnancy
weight and the weight of her baby (p = 0.011). Pre-pregnancy
weight has a positive affect on a baby’s weight with an increase of
0.03 lbs for each extra pound a mother weighs.

 Does the model result in reliable predictions?


 Not really. Only 15.2% of the variation in birth weight is accounted
for using this model.
Regression
Pre-pregnancy weight p-value: p = 0.011
 Regression equation: y = 3.16 + 0.03x

 Interpretation:
 There is a significant relationship between a mothers’ pre-pregnancy
weight and the weight of her baby (p = 0.011). Pre-pregnancy
weight has a positive affect on a baby’s weight with an increase of
0.03 lbs for each extra pound a mother weighs.

 Does the model result in reliable predictions?


 Not really. Only 15.2% of the variation in birth weight is accounted
for using this model.
Multiple regression

 In addition to the standard linear regression checks,


relationships BETWEEN independent variables should be
assessed
 Multicollinearity is a problem where continuous independent
variables are too correlated (r > 0.8)
 Relationships can be assessed using scatterplots and
correlation for scale variables
 SPSS can also report collinearity statistics on request. The VIF
should be close to 1 but under 5 is fine whereas 10 + needs
checking
 Which variables are most strongly related?
Research question

 Clear questions with measurable quantities

 Which variables will help answer these questions

 Think about what test is needed before carrying out a study so


that the right type of variables are collected
Dependent variables

INDEPENDENT DEPENDENT
(explanatory/ (outcome)
affects
predictor) variable
variable

Does attendance have an association with


exam score?
Do women do more housework than men?
What variable type is the
dependent?
Dependent

Scale Categorical

Ordinal Nominal
How many variables are
involved?
 Two – interested in the relationship

 One dependent and one independent

 One dependent and several independent variables:


some may be controls

 Relationships between more than two: multivariate


techniques (not covered here)
Data types

Research question Dependent/ Independent/


outcome variable explanatory variable
Does attendance have an Exam score (scale) Attendance (Scale)
association with exam score?
Do women do more Hours of Gender (binary)
housework than men? housework per
week (Scale)
Exercise:
How would you investigate the following topics?
State the dependent and independent variables
and their variable types.
Research question Dependent/ Independent/
outcome variable explanatory variable
Does weekly hours of work
influence the amount of time
spent on housework?
Which of 3 diets is best for
losing weight?
Exercise:
How would you investigate the following topics?
State the dependent and independent variables
and their variable types.
Research question Dependent/ Independent/
outcome variable explanatory variable
Does weekly hours of work Hours of housework Hours of work (Scale)
(Scale)
influence the amount of time
spent on housework?
Which of 3 diets is best for Weight lost on diet Diet (Nominal)
losing weight? (Scale)
Comparing means
Independent
t-test
2
Comparing
BETWEEN groups

Comparing
means
Comparing
Paired t-test
measurements
WITHIN the same
subject
Comparing means
Independent
t-test
2
Comparing
BETWEEN groups One way
ANOVA

Comparing
means
Comparing
Paired t-test
measurements
WITHIN the same
subject

ANOVA = Analysis of variance


Exercise: Solution
Research question Dependent Independent Test
variable variable
Do women do more
housework than men?

Does Margarine X
reduce cholesterol?
Everyone has
cholesterol measured
on 3 occasions

Which of 3 diets is best


for losing weight?
Exercise: Solution
Research question Dependent Independent Test
variable variable
Do women do more Housework (hrs Gender Independent t-
housework than men? per week) (Nominal) test
(Scale)
Does Margarine X Cholesterol Occasion Repeated
reduce cholesterol? (Scale) (Nominal) measures ANOVA
Everyone has
cholesterol measured
on 3 occasions

Which of 3 diets is best Weight lost on Diet One-way ANOVA


for losing weight? diet (Scale) (Nominal)
Tests investigating relationships
Investigating Dependent Independent Test
relationships between variable variable
2 categorical variables Categorical Categorical Chi-squared test
2 Scale variables Scale Scale Pearson’s correlation
Predicting the value of Scale Scale/binary Simple Linear
an dependent variable Regression
from the value of a Binary Scale/ binary Logistic regression
independent variable

Note: Multiple linear regression is when there


are several independent variables
Exercise: Solution
Research question Dependent Independent Test
variable variables
Does attendance Exam score Attendance Correlation/
affect exam score? (Scale) (Scale) regression
Do women do more Housework Gender (Binary) Regression
housework than men? (hrs per week) Hours worked
(scale) (Scale)
Were Americans more Survival Nationality Chi-squared
likely to survive on (Binary) (Nominal)
board the Titanic?
Survival Nationality , Logistic
(Binary) Gender, class regression

Note: There may be 2 appropriate tests for some


questions
Parametric or non-parametric?
Statistical tests fall into two types:
Assume data follows a
Parametric tests particular distribution
e.g. normal

Nonparametric
techniques are usually
Non-parametric based on ranks/ signs
rather than actual data
Non-parametric tests

Parametric test What to check for Non-parametric test


normality
Independent t-test Dependent variable Mann-Whitney test
by group
Paired t-test Paired differences Wilcoxon signed rank test
One-way ANOVA Residuals/Dependent Kruskal-Wallis test
Repeated measures Residuals Friedman test
ANOVA
Pearson’s Correlation At least one of the Spearman’s Correlation
Co-efficient variables should be Co-efficient
normal
Linear Regression Residuals None – transform the
data
Notes: The residuals are the differences between the
observed and expected values.
Summary

Dependent

Scale Categorical

Ordinal: Nominal:
Normally
Skewed data Non-parametric Chi-squared
distributed
Non-parametric
Parametric test
Q&A
• "We must be careful not to confuse data with
the abstractions we use to analyze them."
William James.
• "Maturity is the capacity to endure uncertainty."
John Finley.
• "Natural selection is a mechanism for
generating an exceedingly high degree of
improbability."R. A. Fisher.

You might also like