ALL Luctures
ALL Luctures
Training session 2
Objectives
• To describe opening and closing SPSS
• To introduce the look and structure of SPSS
• To introduce the data entry windows: Data View and Variable View
• To outline the components necessary to define a variable
• To introduce the SPSS online tutorial
Uses for SPSS
• Data management
• Data analysis
Data management
• Defining variables
• Coding values
• Entering and editing data
• Creating new variables
• Recoding variables
• Selecting cases
Data analysis
• Univariate statistics
• Bivariate statistics
• Multivariate statistics
Opening SPSS
• Double click the SPSS icon on the desktop
OR
Cell information
View tabs
Status bar/boxes
Data View
• Rows represent cases or observations, that is, the objects on which
data have been collected:
• For example, rows represent the contents of a single treatment data
collection form, the information on an individual
• Columns represent variables or characteristics of the object of
interest:
• For example, each column contains the answers to the questions on the
treatment data collection form: age, gender, primary drug of use, etc.
Data Editor
• Data Editor comprises two screens:
• Data View: the previous screen
• Variable View: used to define the variables
• To move between the two:
• Use the View tab at the bottom of the screen
OR
• Ctrl + T
OR
• View/Variables from the Data View window
• View/Data from the Variable View window
Variable View
The data entry process
• Define your variables in Variable View
• Enter the data, the values of the variables, in Data
View
Definition of variables
10 characteristics are used to define a variable:
Name Values
Type Missing
Width Column
Decimals Align
Label Measure
Name
• Each variable must have a unique name of not more than 8 characters
and starting with a letter
• Try to give meaningful variable names:
• Describing the characteristic: for example, age
• Linking to the questionnaire: for example, A1Q3
• Keep the names consistent across files
Type
• Internal formats: • Output formats:
• Numeric • Comma
• String (alphanumeric) • Dot
• Date • Scientific notation
• Dollar
• Custom currency
Numeric
• Numeric variables:
• Numeric measurements
• Codes
• Definition of the size of the variable
String (alphanumeric)
• String variables contain words or characters; strings can include
numbers but, taken here as characters, mathematical operations
cannot be applied to them
• The maximum size of a string variable is 255 characters
Date
• The input format for date variables must be defined, such as
DD/MM/YYYY, MM/DD/YYYY or MM/DD/YY
• Computers store dates as numbers from a base date; in SPSS, dates
are stored as the number of seconds from 14 October 1582
Example
• Create two variables:
• ID: the unique identifier, which will be alphanumeric
with a maximum of 8 characters
• Age: the age of the respondent measured in years, a
discrete variable ranging between 10 and 100
Click here
Click on the String radio button and change the characters to the size of the variable, 8 in this case. Click OK.
Click on the Type column in the second row and define a numeric variable with a maximum size of
3 with no decimal points.
Click on OK to continue.
Note that a number of default values have been entered into the remaining columns.
Labels
• Descriptors for the variables
• Maximum 255 characters
• Used in the output
Variable labels added
Values
• Value labels are descriptors of the categories of a variable
• Coding
Missing
• Defines missing values
• The values are excluded from some analysis
• Options:
• Up to 3 discrete missing values
• A range of missing values plus one discrete missing value
Click in the Missing Values column to obtain the dialogue box below. Enter the value 999 for Age.
Missing values added
Columns and Align
• Columns sets the amount of space reserved to display the contents of
the variable in Data View; generally the default value is adequate
• Align sets whether the contents of the variable appear on the left,
centre or right of the cell in Data View
• Numeric variables are right-hand justified by default and string
variables left-hand justified by default; the defaults are generally
adequate
Measure
• Levels of measurement:
• Nominal
• Ordinal
• Interval
• Ratio
• In SPSS, interval and ratio are designated together
as Scale
• The default for string variables is Nominal
• The default for numeric variables is Scale
Returning to Data View, the first two column headings will reflect the two variables created: ID and Age. Here the first six
observations have been entered.
Exercise: define the necessary variables and enter the following data
Saving the file
• The file must always be saved in order to save the work that has been
done to date:
• File/Save
• Move to the target directory
• Enter a file name
• Save
Summary
• Data Editor • Variable definition
• Data View • Name
• Variable View • Type
• File/Save • Width
• Decimals
• Label
• Values
• Missing
• Columns
• Align
• Measure
Training session 4
• Highestoflevel
Name treatment
of education
centre completed
• Referral source
Employment status
• Gender marital status
Current
• Age old was the patient when they first began using drugs regularly?
How
• Home language
• Region of permanent residence
Level of measurement in SPSS
• Nominal
• Ordinal
• Scale
Exercise: measure
• Return to Ex1.sav and set the level of measurement for the variables
ID, DRUG, AGE and COND
• Save the file
Summary
• Variable
Questiontypes:
types:
• Closed/Open
Levels of measurement
• Discrete
Factual/Attitudinal
(categorical)/continuous
• Quantitative/qualitative
Training session 4
f1
*100 %
n
Exercise: frequency of referral
• Construct a frequency table for referral source in the file main.sav
Referral
Frequency Percent Valid Percent Cumulative Percent
40
30
20
P e rce n t
10
0
Self/Fam/Friends Employer Courts/Corrections School Unknown
Welfare Health Pro Relgious Grp Hosp/Clinic
Referral
Frequencies: Statistics button
Referral Statistics
N Valid 1541
Missing 30
Mode 1
Frequencies: syntax
• FREQUENCIES
• VARIABLES=refsourc
• /FORMAT=DFREQ
• /STATISTICS=MODE
• /BARCHART PERCENT
• /ORDER=ANALYSIS.
Exercise: frequencies
• Generate a frequency table and bar chart for each of the following
variables and comment:
• Race
• Education
• Employment
• Save the output and the syntax file
Frequency: Race
Race
Frequency Percent Valid Percent Cumulative Percent
23.00 10 .6 .6 98.4
Out-of-range values
(note that none of
24.00 11 .7 .7 99.1
the digits are > 5)
25.00 5 .3 .3 99.4
34.00 4 .3 .3 99.7
234.00 5 .3 .3 100.0
Grand total
Column totals
Joint frequencies
Percentages
• The difference in sample size for men and women makes comparison
of raw numbers difficult
• Percentages facilitate comparison by standardizing the scale
• There are three options for the denominator of the percentage:
• Grand total
• Row total
• Column total
Mode of ingestion Drug1 * Gender cross-tabulation
Gender
Male Female Total
Snort Count 44 17 61
Marginal
% of Total 2.9% 1.1% 4.0% distribution
Inject Count 20 10 30 Mode1
Joint distribution
Mode1 & Gender
Marginal distribution
Gender
Mode of ingestion Drug1 * Gender cross-tabulation
Gender
Snort Count 44 17 61
% within Mode of 72.1% 27.9% 100.0%
ingestion Drug1
Inject Count 20 10 30
% within Mode of 66.7% 33.3% 100.0%
ingestion Drug1
Snort Count 44 17 61
% within Gender 3.6% 5.7% 4.0%
Inject Count 20 10 30
% within Gender 1.6% 3.4% 2.0%
Total Count 1271 298 1515
n=553
n=194
n=44 n=77
n=20
n=17
n=10
Dimensions
Definitions of vertical
and horizontal variables
Two-by-two tables
• Tables with two rows and two columns
• A range of simple descriptive statistics can be applied to two-by-two
tables
• It is possible to collapse larger tables to these dimensions
Gender * White pipe cross-tabulation
White pipe
Yes No Total
Yes No
Relative risk of
“failure”
Exercise 1: cross-tabulations
• Create and comment on the following cross-tabulations:
• Age vs Gender (Raw total)
• Region vs Gender (Column total)
• School vs Gender (Sum total)
• Year of study vs School (All three)
• Suggest other cross-tabulations that would be useful
Exercise 2: cross-tabulation
• Construct a dichotomous variable for age: Up to 24 years and Above
24 years
• Construct a dichotomous variable for the primary drug of use:
Alcohol and Not Alcohol
• Create a cross-tabulation of the two new variables and interpret
• Generate Relative Risks and Odds Ratios and interpret
Summary
• Cross-tabulations
• Joint frequencies
• Marginal frequencies
• Row/Column/Total percentages
• Relative risk
• Odds
• Odds ratios
• Working with relationships between two variables
• Size of Teaching Tip & Stats Test Score
100
90
80
70
60
Stats
Test 50
Score 40
30
20
10
0
$0 $20 $40 $60 $80
Correlation & Regression
• Univariate & Bivariate Statistics
• U: frequency distribution, mean, mode, range, standard deviation
• B: correlation – two variables
• Correlation
• linear pattern of relationship between one variable (x) and another variable (y) – an
association between two variables
• relative position of one variable correlates with relative distribution of another variable
• graphical representation of the relationship between two variables
• Warning:
• No proof of causality
• Cannot assume x causes y
Scatterplot!
• No Correlation
• Random or circular assortment
of dots
• Positive Correlation
• ellipse leaning to right
• GPA and SAT
• Smoking and Lung Damage
• Negative Correlation
• ellipse learning to left
• Depression & Self-esteem
• Studying & test errors
Pearson’s Correlation Coefficient
• “r” indicates…
• strength of relationship (strong, weak, or none)
• direction of relationship
• positive (direct) – variables move in same direction
• negative (inverse) – variables move in opposite directions
• r ranges in value from –1.0 to +1.0
-1.0 0.0 +1.0
Strong Negative No Rel. Strong Positive
•Go to website!
– playing with scatterplots
Practice with Scatterplots
r = .__ __ r = .__ __
r = .__ __
r = .__ __
Correlations
Miles walked
per day Weight Depression Anxiety
Miles walked per day Pearson Correlation 1 -.797** -.800** -.774**
Sig. (2-tailed) .002 .002 .003
N 12 12 12 12
Weight Pearson Correlation -.797** 1 .648* .780**
Sig. (2-tailed) .002 .023 .003
N 12 12 12 12
Depression Pearson Correlation -.800** .648* 1 .753**
Sig. (2-tailed) .002 .023 .005
N 12 12 12 12
Anxiety Pearson Correlation -.774** .780** .753** 1
Sig. (2-tailed) .003 .003 .005
N 12 12 12 12
**. Correlation is significant at the 0.01 level (2-tailed).
*. Correlation is significant at the 0.05 level (2-tailed).
Samples vs. Populations
• Sample statistics estimate Population parameters
• M tries to estimate μ
• r tries to estimate ρ (“rho” – greek symbol --- not “p”)
•r correlation for a sample
• based on a the limited observations we have
•ρ actual correlation in population
• the true correlation
• Beware Sampling Error!!
• even if ρ=0 (there’s no actual correlation), you might get r =.08 or r = -.26 just by
chance.
• We look at r, but we want to know about ρ
Hypothesis testing with Correlations
• Two possibilities
• Ho: ρ = 0 (no actual correlation; The Null Hypothesis)
• Ha: ρ ≠ 0 (there is some correlation; The Alternative Hyp.)
• Case #1 (see correlation worksheet)
• Correlation between distance and points r = -.904
• Sample small (n=6), but r is very large
• We guess ρ < 0 (we guess there is some correlation in the pop.)
• Case #2
• Correlation between aiming and points, r = .628
• Sample small (n=6), and r is only moderate in size
• We guess ρ = 0 (we guess there is NO correlation in pop.)
• Bottom-line
• We can only guess about ρ
• We can be wrong in two ways
Reading Correlation Matrix Correlationsa
Time spun
Total ball Distance before Aiming Manual College grade Confidence
toss points from target throwing accuracy dexterity point avg for task
Total ball toss points Pearson Correlation 1 -.904* -.582 .628 .821* -.037 -.502
Sig. (2-tailed) . .013 .226 .181 .045 .945 .310
N 6 6 6 6 6 6 6
Distance from target Pearson Correlation -.904* 1 .279 -.653 -.883* .228 .522
Sig. (2-tailed) .013 . .592 .159 .020 .664 .288
N 6 6 6 6 6 6 6
Time spun before Pearson Correlation -.582 .279 1 -.390 -.248 -.087 .267
throwing Sig. (2-tailed) .226 .592 . .445 .635 .869 .609
N
6 6 6 6 6 6 6
Aiming accuracy Pearson Correlation .628 -.653 -.390 1 .758 -.546 -.250
Sig. (2-tailed) .181 .159 .445 . .081 .262 .633
N 6 6 6 6 6 6 6
Manual dexterity Pearson Correlation .821* -.883* -.248 .758 1 -.553 -.101
Sig. (2-tailed) .045 .020 .635 .081 . .255 .848
N 6 6 6 6 6 6 6
r = -.904
College grade point avg Pearson Correlation -.037 .228 -.087 -.546 -.553 1 -.524
Sig. (2-tailed) .945 .664 .869 .262 .255 . .286
N 6 6 6 6 6 6 6
Confidence for task Pearson Correlation -.502 .522 .267 -.250 -.101 -.524 1
Sig. (2-tailed)
N
.310
6
.288
6
.609
6
.633
6
.848
6
.286
6
.
6
p = .013 -- Probability of getting a
*. Correlation is significant at the 0.05 level (2-tailed).
a. Day sample collected = Tuesday
correlation this size by sheer chance.
Reject Ho if p ≤ .05.
a
Co rrel ati o ns
Ti me s pun
Tot al bal l D is t anc e bef ore A im i ng M anual Co lleg e gra de Conf id enc e
t os s poi nt s f rom t arget th row ing ac c urac y de x te ri ty p oi nt av g f or t as k
To ta l ball tos s point s P ears on C orrel at ion 1 - .90 4* -. 582 .6 28 . 821* -.0 37 -.50 2
S ig . (2-t aile d) . . 013 . 226 .1 81 . 045 . 945 . 310
N 6 6 6 6 6 6 6
Di s t anc e f rom t arget P ears on C orrel at ion -. 9 04* 1 . 279 -. 653 -. 883* . 228 . 522
S ig . (2-t aile d) . 013 . . 592 .1 59 . 020 . 664 . 288
N 6 6 6 6 6 6 6
Ti m e s pun be fo re P ears on C orrel at ion -. 5 82 . 279 1 -. 390 -. 248 -.0 87 . 267
sample
t hrowi ng S ig . (2-t aile d) . 226 . 592 . .4 45 . 635 . 869 . 609
N
6 6 6 6 6 6 6
A i m ing ac c urac y P ears on C orrel at ion . 628 - .65 3 -. 390 1 . 758 -.5 46 -.25 0
S ig . (2-t aile d) . 181 . 159 . 445 . . 081 . 262 . 633
N 6 6 6 6 6 6 6
M an ual dex t e ri t y P ears on C orrel at ion . 821* - .88 3* -. 248 .7 58 1 -.5 53 -.10 1
S ig . (2-t aile d) . 045 . 020 . 635 .0 81 . . 255 . 848
N 6 6 6 6 6 6 6
size
Co lleg e grade poi nt av g P ears on C orrel at ion -. 0 37 . 228 -. 087 -. 546 -. 553 1 -.52 4
S ig . (2-t aile d) . 945 . 664 . 869 .2 62 . 255 . . 286
regression worksheet
100
80
To t a l b a ll to s s p o in ts
60
y’=47
40
y’=20 20
0 Rsq = 0.6031
8 10 12 14 16 18 20 22 24 26
•“Predictor,”
•x-axis variable,
•what you’re basing the
prediction on
y’ = b (x) + a
y’ = -4.263(20) + 125.401
a
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 125.401 14.265 8.791 .000
Distance from target -4.263 .815 -.777 -5.230 .000
a. Dependent Variable: Total ball toss points
b
Predictive Ability
• Mantra!!
• As variability decreases, prediction accuracy ___
• if we can account for variance, we can make better predictions
• As r increases:
• r² increases
• “variance accounted for” increases
• the prediction accuracy increases
• prediction error decreases (distance between y’ and y)
• Sy’ decreases
• the standard error of the residual/predictor
• measures overall amount of prediction error
• We like big r’s!!!
Drawing a Regression Line by Hand
Three steps
1. Plug zero in for x to get a y’ value, and then plot this value
• Note: It will be the y-intercept
2. Plug in a large value for x (just so it falls on the right end of the
graph), plug it in for x, then plot the resulting point
3. Connect the two points with a straight line!
Chi-square Test of
Independence
Training session 6
Chi-square Test of Independence
• The chi-square test of independence is probably the most frequently used
hypothesis test in the social sciences.
• The chi-square test of independence can be used for any variable; the group
(independent) and the test variable (dependent) can be nominal, dichotomous,
ordinal, or grouped interval.
• If there is no relationship between gender and attending college and 40% of our
total sample attend college, we would expect 40% of the males in our sample to
attend college and 40% of the females to attend college.
P o p o r t io n A t t e n d in g C o lle g e
Inde pe nde nt Re lations hip De pe ndent Re lations hip
be tw ee n Ge nde r and Colle ge be tw e en Ge nder and Colle ge
100% 100%
80% 80%
60%
60% 60%
40% 40% 40% 40%
40% 40%
20%
20% 20%
0% 0%
Males Females Total Males Females Total
• Since the proportion of subjects in each category of the group variable can differ,
we take group category into account in computing expected frequencies as well.
• Probabilities for the test statistic can be obtained from the chi-square probability
distribution so that we can test hypotheses.
• The test variable is also known as the dependent variable because its value is believed to be
dependent on the value of the group variable.
• The null hypothesis is that the two variables are independent. This will be true if
the observed counts in the sample are similar to the expected counts.
• We identify the value and probability for this test statistic from the SPSS statistical
output.
• If the probability of the test statistic is greater than the probability of the alpha
error rate, we fail to reject the null hypothesis. We conclude that there is no
relationship between the variables, i.e. they are independent.
• The residual, or the difference, between the observed frequency and the
expected frequency is a more reliable indicator, especially if the residual is
converted to a z-score and compared to a critical value equivalent to the alpha
for the problem.
• Standardized residuals that have a negative value mean that the cell was under-
represented in the actual sample, compared to the expected frequency, i.e. there
were fewer subjects in this category than we expected.
Researcher often try to identify try to identify which cell or cells are the major contributors to
the significant chi-square test by examining the pattern of column percentages.
Based on the column percentages, we would identify cells on the married row and the
widowed row as the ones producing the significant result because they show the largest
differences: 8.2% on the married row (50.9%-42.7%) and 9.0% on the widowed row (13.1%-
4.1%)
SW318 Social Work Statistics Slide 159
Interpreting Cell Differences in
a Chi-square Test - 3
Using a level of significance of 0.05, the critical value for a standardized residual
would be -1.96 and +1.96. Using standardized residuals, we would find that only the
cells on the widowed row are the significant contributors to the chi-square
relationship between sex and marital status.
This question asks you to use a chi-square test of independence and, if significant, to do a post
hoc test using 1.96 of the critical value.
First of all, the level of measurement for the independent and the dependent variable can be
any level that defines groups (dichotomous, nominal, ordinal, or grouped interval). “degree of
religious fundamentalism" [fund] is ordinal and "sex" [sex] is dichotomous, so the level of
measurement requirements are satisfied.
SW318 Social Work Statistics Slide 161
Chi-Square Test of Independence: post hoc
test in SPSS (1)
This question asks you to use a chi-square test of independence and, if significant, to do a post
hoc test using -1.96 of the critical value.
First of all, the level of measurement for the independent and the dependent variable can be
any level that defines groups (dichotomous, nominal, ordinal, or grouped interval). [empathy3]
is ordinal and [sex] is dichotomous, so the level of measurement requirements are satisfied.
Now, you can examine the post hoc test using the
given critical value.
SW318 Social Work Statistics Slide 176
Chi-Square Test of Independence: post hoc
test in SPSS (15)
The residual is the difference between the
actual frequency and the expected frequency
(58-79.2=-21.2).
Yes
Yes
Incorrect
Expected cell counts less than 5? application of a
statistic
No
Yes
Identify the cell in the crosstabs table that contains the specific
relationship in the problem
Yes
No
Is the relationship correctly described?
False
Yes