Statistical Computing I
Statistical Computing I
STATISTICAL COMPUTING I
By:
REVIEWER:
Shewayiref Geremew (M.Sc.,Biostatistics), Staff member of Statistics Department
May, 2018
1
Preface
SPSS, standing for Statistical Package for the Social Sciences, is a powerful, user-
friendly software package for the manipulation and statistical analysis of data. The
package is particularly useful for students and researchers in statistics, psychology,
sociology, psychiatry, and other behavioral sciences, containing as it does an extensive
range of both univariate and multivariate procedures much used in these disciplines.
Each chapter ends with a number of exercises, some relating to the data sets introduced
in the chapter and others introducing further data sets. Working through these exercises
will develop SPSS, Minitab and statistical skill.
i
Table of Content
1. Introduction to SPSS ............................................................................................................. 1
1.1. What is a statistical package? ........................................................................................... 1
1.2. Starting SPSS ................................................................................................................... 1
1.3. Overview of SPSS for windows ....................................................................................... 2
1.4. The menus and their use ................................................................................................... 6
1.5. Entering and saving data in SPSS .................................................................................... 7
1.6. Data Importing From Microsoft Excel and ASCII files ................................................. 18
2. Modifying and Organizing Data ........................................................................................ 21
2.1. Retrieving data ............................................................................................................... 21
2.2. Inserting cases and variables .......................................................................................... 21
2.3. Deleting cases or variables ............................................................................................. 22
2.4. Transforming Variables with the Compute Command .................................................. 22
2.5. Transforming Variables with the Recode Command ..................................................... 23
2.5.1. Banding Values . …………………………………………………………..26
2.6. Keeping and dropping of cases ...................................................................................... 32
2.7. Collapsing and transposing Data .................................................................................... 33
2.8. Listing Cases .................................................................................................................. 39
3. Descriptive Statistics Using SPSS....................................................................................... 41
3.1. Summarizing Data .......................................................................................................... 41
3.1.1. Producing Frequency distribution ............................................................... 41
3.1.2. Descriptive Statistics ................................................................................... 42
3.1.3. Cross Tabulation ......................................................................................... 42
3.1.4. Diagrams and graphs................................................................................... 45
4. Customizing SPSS Outputs and Reporting ....................................................................... 50
4.1. Customizing SPSS outputs ............................................................................................. 50
4.1.1. Modifying Tables ........................................................................................ 50
4.1.2. Exporting Tables in SPSS ........................................................................... 53
4.1.3. Modifyingscatter plot .................................................................................. 54
4.1.4. Modifying and Exporting Graphs ............................................................... 55
5. Introduction to Minitab ...................................................................................................... 63
ii
5.1. How to start and exit Minitab ......................................................................................... 63
5.2. Minitab windows: worksheet, session and project ......................................................... 63
5.2.1. Worksheet Window .................................................................................... 63
5.2.2. Session Window.......................................................................................... 64
5.2.3. Minitab Project............................................................................................ 65
5.2.4. Moving between windows .......................................................................... 65
5.2.5. Understanding the interface ........................................................................ 66
5.3. The menu and their use .................................................................................................. 67
5.4. Type data ........................................................................................................................ 67
5.5. Entering and saving data ................................................................................................ 68
5.5.1. Entering the Data ........................................................................................ 68
5.5.2. Saving Minitab data .................................................................................... 69
5.6. Importing and Exporting data ........................................................................................ 70
5.6.1. Importing Data from Excel ......................................................................... 70
5.6.2. Opening a text file....................................................................................... 71
5.6.3. Export data .................................................................................................. 74
6. Descriptive Statistics Using Minitab .................................................................................. 76
7. Statistical Analysis Using Minitab and SPSS .................................................................. 105
7.1. Inferential statistics Using Minitab .............................................................................. 105
7.2. Inferential Statistics using SPSS .................................................................................. 114
7.3. Regression and Correlation .......................................................................................... 125
7.3.1. Correlation Analysis in SPSS ..................................................... ………..125
7.3.2. Linear Regression ..................................................................................... 127
7.3.3. Regression Diagnostics using SPSS ......................................................... 131
7.3.3.1. Unusual and Influential data ................................................................. 132
7.3.3.3. Collinearity ............................................................................................ 165
7.3.3.4. Tests on Nonlinearity ............................................................................ 169
7.3.3.5. Model Specification .............................................................................. 175
7.3.3.6. Issues of Independence.......................................................................... 179
7.3.3.7. Summary ............................................................................................... 180
REFERENCES ...................................................................................................................... 182
iii
1. Introduction to SPSS
1
All Programs
SPSS for Windows
A small window will appear. This window has several choices with the following questions
and options.
What would you like to do?
• Run tutorial
• Type in Data
• Run an existing query
• Create new query using an existing data base
• Open an existing data source
If you choose type in data, you will get Data Editor Window.
2
Data View window
Click on Data View tab at the bottom of the screen to open the “Data view” window. The
window is simply a grid with rows and columns which display the content of a data file.
3
refused to answer from data missing because the question did not apply to the
respondent.
Columns: Use this to adjust the width of the Data Editor columns, note that if the
actual width of a value is wider than the column, asterisks are displayed in the Data
View.
Align: To change the alignment of the value in the column (left, right or centre)
Measure: You can specify the level of measurement of the variable as scale, ordinal
or nominal.
The SPSS variable naming convention requires the following:
The variable names should be up to eight characters or fewer.
The variable name should not begin with any special character such as
numerals, comma, inequality symbols etc.
The latest versions of SPSS can accept variable names with length greater
than 8 characters
variable in a given file must be unique, duplication is not allow
do not end variable names with an underscore and a period.
names are not case sensitive.
Variable names cannot contain spaces.
To write your variable name:
Note: To find the rules for naming variables press: Help, Topics, Index. Then enter the
phrase variable names: rules. Then press the Display button
The Output Viewer
The Output Viewer opens automatically when you execute an analysis or create a
graph using dialog box or command syntax to execute a procedure.
All statistical results, tables, and charts are displayed in the Viewer. You can edit the
output and save it for later use. A Viewer window opens automatically the first time
you run a procedure that generates output.
4
The Output Viewer is divided into two panes. The right-hand pane contains statistical tables,
charts, and text output. The left-hand pane contains a tree structure similar to those used in
Windows Explorer, which provides an outline view of the contents.
Output that is displayed in pivot tables can be modified in many ways with the Pivot
Table Editor. You can edit text, swap data in rows and columns, add color, create
multidimensional tables, and selectively hide and show results.
Syntax Editor
A text editor where you compose SPSS commands and submit them to the SPSS
processor. All outputs from this command appear in the output view.
Chart Editor
You can modify high-resolution charts and plots in chart windows. You can change the
colors, select different type fonts or sizes, switch the horizontal and vertical axes, rotate
3-D scatter plots, and even change the chart type.
5
1.4. The menus and their use
Each window in SPSS has their own menus. The common menus are:
File (new, open, save, save as, etc,…)
Edit (undo, redo, cut, copy, insert cases/variables, etc,…)
View (value labels, etc,…)
Analyses (descriptive statistics, tables, compare means, correlate, regression, etc)
Graphs (bar, pie, scatter plot, histogram, etc)
Window (split, minimize the window, etc)
Help (topics, tutorial, etc)
Data Editor Menus
The menu bar provides easy access to most SPSS features. It consists of ten drop-down
menus:
6
1.5. Entering and saving data in SPSS
Entering the Data
You may also use the Up, Down, Left, and Right arrow keys to enter values and move
to another cell for data input
To edit existing data points (i.e., change a specific data value), click in the cell, type in
the new value, and press the Tab, Enter, Up, Down, Right, or Left arrow keys
In Data View, you enter your data just as you would in a spreadsheet program. You can move
from cell to cell with the arrow keys on your keyboard or by clicking on the cell with the
mouse.
Once one case (row) is complete, begin entering another case at the beginning of the
next row.
You can delete a row of data by clicking on the row number at the far left and pushing
the delete key on your keyboard.
In a similar fashion, you delete a variable (column) by clicking on the variable name
so that the entire column is highlighted and pushing the delete key.
In the steps that follow, we would see how to type in data by defining different variable types.
Click the Variable View tab at the bottom of the Data Editor window. Define the variables
that are going to be used. In our case, let us consider three variables: namely age, marital
status, and income.
In the first row of the first column, type age.
In the second row, type marital.
In the third row, type income.
7
New variables are automatically given a numeric data type. If you don't enter variable names,
unique names are automatically created. However, these names are not descriptive and are not
recommended for large data files.
8
In the Label column of the marital row, type Marital Status.
In the Label column of the income row, type Household Income.
In the Label column of the sex row, type Gender.
Adding a Variable Label: Click the Variable View tab at the bottom of the Data Editor
window. In the Label column of the age row, type Respondent's Age. In the Label column of
the marital row, type Marital Status. In the Label column of the income row, type Household
Income. In the Label column of the sex row, type Gender.
The Type column displays the current data type for each variable. The most common are
numeric and string, but many other formats are supported.
In the current data file, the income variable is defined as a numeric type.
Click the Type cell for the income row, and then click the button to open the Variable Type
dialog box.
Select Dollar inthe Variable Type dialog box. The formatting options for the currently
selected data type are displayed. Select the format of this currency. For this example, select
$###,###,###.
Click OK to save your changes.
Value labels provide a method for mapping your variable values to a string label. In the case
of this example, there are two acceptable values for the marital variable. A value of “0”
means that the subject is single and a value of “1” means that he or she is married.
Click the values cell for the marital row, and then click the button to open the Value
Labels dialog box.
The value is the actual numeric value.
The value label is the string label applied to the specified numeric value.
Type “0” in the value field.
Type “Single” in the Value Label field.
Click Add to add this label to the list.
Repeat the process, this time typing 1 in the value field and Married in the Value Label field.
Click Add, and then click OK to save your changes and return to the Data Editor.
These labels can also be displayed in Data View, which can help to make your data more
readable.
Click the Data View tab at the bottom of the Data Editor window.
From the menus choose:
9
View
Value Labels
The labels are now displayed in a list when you enter values in the Data Editor. This has the
benefit of suggesting a valid response and providing a more descriptive answer.
Adding Value Labels for String Variables
String variables may require value labels as well. For example, your data may use single
letters, M or F, to identify the sex of the subject.
Value labels can be used to specify that M stands for Male and F stands for Female.
Click the Variable View tab at the bottom of the Data Editor window.
Click the Values cell in the sex row, and then click the button to open the Value Labels dialog
box.
Type F in the value field, and then type Female in the Value Label field.
Click Add to add this label to your data file.
Repeat the process, this time typing M in the Value field and Male in the Value Label field.
Click Add, and then click OK to save your changes and return to the Data Editor.
Because string values are case sensitive, you should make sure that you are consistent. A
lowercase m is not the same as an uppercase M.
In a previous example, we choose to have value labels displayed rather than the actual data by
selecting Value Labels from the View menu. You can use these values for data entry.
Click the Data View tab at the bottom of the Data Editor window. In the first row, select the
cell for sex and select Male from the drop-down list.
In the second row, select the cell for sex and select Female from the drop-down list. Only
defined values are listed, which helps to ensure that the data entered are in a format that you
expect.
Handling Missing Data
Missing or invalid data: are generally too common to ignore. Survey respondents may refuse
to answer certain questions, may not know the answer, or may answer in an unexpected
format.
If you don't take steps to filter or identify these data, your analysis may not provide accurate
results.
For numeric data, empty data fields or fields containing invalid entries are handled by
converting the fields to system missing, which is identifiable by a single period.
10
The reason a value is missing may be important to your analysis. For example, you may find it
useful to distinguish between those who refused to answer a question and those who didn't
answer a question because it was not applicable.
Click the Variable View tab at the bottom of the Data Editor window. Click the Missing cell
in the age row, and then click the button to open the Missing Values dialog box. In this
dialog box, you can specify up to three distinct missing values, or a range of values plus one
additional discrete value.
Select Discrete missing values. Type 999 in the first text box and leave the other two
empty.
Click OK to save your changes and return to the Data Editor. Now that the missing data value
has been added, a label can be applied to that value. Click the Values cell in the age row, and
then click the button to open the Value Labels dialog box.
Type 999 in the Value field. Type No Response in the Value Label field. Click Add to add
this label to your data file. Click OK to save your changes and return to the Data Editor.
Missing values for string variables are handled similarly to those for numeric values.
Unlike numeric values, empty fields in string variables are not designated as system missing.
Rather, they are interpreted as an empty string. Click the Variable View tab at the bottom of
the Data Editor window.
Click the Missing cell in the sex row, and then click the button to open the Missing
Values dialog box. Select Discrete missing values. Type NR in the first text box.
Missing values for string variables are Case sensitive. So, a value of “nr” is not treated as a
missing value.
Click OK to save your changes and return to the Data Editor. Now you can add a label for the
missing value. Click the Values cell in the sex row, and then click the button to open the
Value Labels dialog box. Type NR in the Value field. Type “No Response” in the Value
Label field. Click Add to add this label to your project. Click OK to save your changes and
return to the Data Editor.
Once you've defined variable attributes for a variable, you can copy these attributes and
apply them to other variables.
In Variable View, type agewed in the first cell of the first empty row. In the Label column,
type Age Married. Click the Values cell in the age row.
From the menus choose:
11
Edit
Copy
Click the Values cell in the agewed row
From the menus choose:
Edit
Paste
The defined values from the age variable are now applied to the agewed variable. To apply the
attribute to multiple variables, simply select multiple target cells (click and drag down the
column).
When you paste the attribute, it is applied to all of the selected cells. New variables are
automatically created if you paste the values into empty rows.
You can also copy all of the attributes from one variable to another. Click the row number in
the marital row.
From the menus choose:
Edit
Copy
Click the row number of the first empty row.
From the menus choose:
Edit
Paste
All of the attributes of the marital variable are applied to the new variable.
For categorical (nominal, ordinal) data, Define Variable Properties can help you define
value labels and other variable properties. Define Variable Properties:
Scans the actual data values and lists all unique data values for each selected variable.
Identifies unlabeled values and provides an "auto-label" feature.
Provides the ability to copy defined value labels from another variable to the selected
variable or from the selected variable to multiple additional variables.
This example uses the data file demo.sav. This data file already has defined value labels; so
before we start, let's enter a value for which there is no defined value label:
In Data View of the Data Editor, click the first data cell for the variable ownpc (you may have
to scroll to the right) and enter the value 99.
From the menus choose:
12
Data
Define Variable Properties...
In the initial Define Variable Properties dialog box, you select the nominal or ordinal
variables for which you want to define value labels and/or other properties.
Since Define Variable Properties relies on actual values in the data file to help you make good
choices, it needs to read the data file first. This can take some time if your data file contains a
very large number of cases, so this dialog box also allows you to limit the number of cases to
read, or scan.
Limiting the number of cases is not necessary for our sample data file. Even though it contains
over 6,000 cases, it doesn't take very long to scan that many cases.
Drag and drop Owns computer [ownpc] through Owns VCR [ownvcr] into the Variables to
Scan list.
You might notice that the measurement level icons for all of the selected variables indicate
that they are scale variables, not categorical variables. By default, all numeric variables are
assigned the scale measurement level, even if the numeric values are actually just codes that
represent categories.
All of the selected variables in this example are really categorical variables that use the
numeric values 0 and 1 to stand for No and Yes, respectively--and one of the variable
properties that we'll change with Define Variable Properties is the measurement level.
Click Continue
In the Scanned Variable List, select ownpc. The current level of measurement for the selected
variable is scale. You can change the measurement level by selecting one from the drop-down
list or you can let Define Variable Properties suggest a measurement level.
Click Suggest
Since the variable doesn't have very many different values and all of the scanned cases contain
integer values, the proper measurement level is probably ordinal or nominal.
Select Ordinal and then click Continue.
The measurement level for the selected variable is now ordinal.
The Value Labels Grid displays all of the unique data values for the selected variable, any
defined value labels for these values, and the number of times (count) each value occurs in the
scanned cases
13
The value that we entered, 99, is displayed in the grid. The count is only 1 because we
changed the value for only one case, and the Label column is empty because we haven't
defined a value label for 99 yet.
An X in the first column of the Scanned Variable List also indicates that the selected variable
has at least one observed value without a defined value label.
In the Label column for the value of 99, enter No answer.
Then click (check) the box in the Missing column. This identifies the value 99 as user
missing. Data values specified as user missing are flagged for special treatment and are
excluded from most calculations.
Before we complete the job of modifying the variable properties for ownpc, let's apply the
same measurement level, value labels, and missing values definitions to the other variables in
the list. In the Copy Properties group, click To Other Variables.
In the Apply Labels and Level to dialog box, select all of the variables in the list, and then
click Copy. If you select any other variable in the list in the Define Variable Properties main
dialog box now, you'll see that they are all now ordinal variables, with a value of 99 defined
as user missing and a value label of No answer. Click OK to save all of the variable properties
that you have defined. By doing so, we copied the property of the ownpc variable to the other
five selected variables.
Exercise-1:The following small data set consists of four variables namely, Agecat, gender,
acid and pop.
Where: agecat is a categorical variable created for age.
1= ‘Under 21 ‘ 2= ‘21-25’, and 3= ‘26-30’
Gender: 0 = ‘Male’ and 1= ‘Female’
Accid and Pop are numeric.
After defining these variables in a data editor window, enter the following data for the
variables Agecat, Gender, Accid and Pop respectively. Your data should appear as given
below. Save the data set as trial1.sav.
1 1 57997 198522
2 1 57113 203200
3 1 54123 200744
1 0 63936 187791
2 0 64835 195714
3 0 66804 208239
14
Exercise-2: Create a data set called Trial2.sav from the following data. The data set has the
following variables:
I. Subject numeric width =2, right aligned and columns = 8
II. Anxiety: numeric width =2, right aligned and columns = 8
III. Tension: numeric width =2, right aligned and columns = 8
IV. Score: numeric width =2, right aligned and columns = 8
V. Trial: numeric width =2, right aligned and columns = 8
In addition there is no value level for each of the above variables. After completing the
definition of the above variables, type in the following data into your data editor window so
that your data appears as given below.
1 1 1 18 1
1 1 1 14 2
1 1 1 12 3
1 1 1 6 4
2 1 1 19 1
2 1 1 12 2
2 1 1 8 3
2 1 1 4 4
3 1 1 14 1
3 1 1 10 2
3 1 1 6 3
3 1 1 2 4
4 1 2 16 1
4 1 2 12 2
4 1 2 10 3
4 1 2 4 4
5 1 2 12 1
5 1 2 8 2
5 1 2 6 3
5 1 2 2 4
6 1 2 18 1
6 1 2 10 2
15
6 1 2 5 3
6 1 2 1 4
7 2 1 16 1
7 2 1 10 2
7 2 1 8 3
7 2 1 4 4
8 2 1 18 1
8 2 1 8 2
8 2 1 4 3
8 2 1 1 4
9 2 1 16 1
9 2 1 12 2
9 2 1 6 3
9 2 1 2 4
10 2 2 19 1
10 2 2 16 2
10 2 2 10 3
10 2 2 8 4
11 2 2 16 1
11 2 2 14 2
11 2 2 10 3
11 2 2 9 4
12 2 2 16 1
12 2 2 12 2
12 2 2 8 3
Exercise-3: Given below is an example of a questionnaire, suppose you have information on
several of such questionnaires. Prepare a data entry format that will help you to enter your
data to SPSS.
Examples of questionnaire Design
Name ____________________________________________________________
Age ______________ ____________Sex ________________________________
City __________________________________________________________________
16
Marital Status □ Married □ Single
Family Type □ Joint □ Nuclear
Family Members □ Adults □ Children
Family Income □ less than 10,000 □ 10, 000 to 15,000
□ 15,000-20,000 □ More than 20, 000
Date:______________
Place:______________
1. What kind of food do you normally eat at home?
□ North Indian □ South Indian □ Chinese □ Continental
2. How frequently do you eat out?
In a week □ once □ Twice □ Thrice □ More than thrice
3. You usually go out with:
□ Family □ Friends □ Colleagues □ Others _______________
4. Is there any specific day when you go out?
□ Weekdays □ Weekends □ Holidays □Special occasions
□ No specific days
5. You generally go out for
□ Lunch □ Snacks □ Dinner □ Party/Picnics
6. Where do you usually go?
□ Restaurant □ Chinese Joint □Fast food joint □others __________
7. Who decide on the place to go?
□ Husband □ Wife □ Children □ Others ______________
8. How much do you spend on eating out (one time)?
□ Below 200 □ 200-500 □ 500-800 □ More than 800
9. What did you normally order?
□ Pizza □ Burgers □ Curries and Breads □ Pasta
10. The price paid by you for the above is
10.1 Pizza: □ Very high □ A little bit high □ Just right
10.2 Burgers: □ Very high □ A little bit high □ Just right
10.3 Curries and Breads: □ Very high □ A little bit high □ Just right
10.4 Soups: □ Very high □ A little bit high □ Just right
10.5 Pasta: □ Very high □ A little bit high □ Just right
17
1.6. Data Importing From Microsoft Excel and ASCII files
Data can be directly entered in SPSS (as seen above), or a file containing data can be opened
in the Data Editor. From the menu in the Data Editor window, choose the following menu
options.
File
Open...
If the file you want to open is not an SPSS data file, you can often use the Open menu
item to import that file directly into the Data Editor.
If a data file is not in a format that SPSS recognizes, then try using the software
package in which the file was originally created to translate it into a format that can
be imported into SPSS.
Importing Data from Excel Files
Data can be imported into SPSS from Microsoft Excel with relative ease. If you are working
with a spreadsheet in another software package, you may want to save your data as an Excel
file, then import it into SPSS.
To open an Excel file, select the following menu options from the menu in the Data Editor
window in SPSS.
File
Open...
First, select the desired location on disk using the Look in option. Next, select Excel from the
Files of type drop-down menu. The file you saved should now appear in the main box in the
Open File dialog box. You can open it by double-clicking on it. You will see one more dialog
box which appears as follows.
18
This dialog box allows you to select a spreadsheet from within the Excel Workbook.
The drop-down menu in the example shown above offers two sheets from which to choose.
As SPSS only operates on one spreadsheet at a time, you can only select one sheet from this
menu.
This box also gives you the option of reading variable names from the Excel Workbook
directly into SPSS.
Click on the Read variable names box to read in the first row of your spreadsheet as the
variable names.
If the first row of your spreadsheet does indeed contain the names of your variables and you
want to import them into SPSS, these variables names should conform to SPSS variable
naming conventions (eight characters or fewer, not beginning with any special characters).
You should now see data in the Data Editor window. Check to make sure that all variables
and cases were read correctly. Next, save your dataset in SPSS format by choosing the Save
option in the File menu.
Example: Import an excel data set called book1.xls into SPSS data editor window from the
desktop.
The procedure is as follows:
File
Open... Data
After you select data you will see a window with the header “opens file”. On the same
window, select the desktop using the Look in option
Then select Excel (*.xls) from the file type drop down menu. Then another small window will
appear. In this window you may see that there is only one worksheet. Now if the first row of
the Book1.xls data set has variables names, then you select the option “Read variable names
from the first row of the data”. Subsequently, SPSS will consider the elements of the first row
as variables. If the first of row of book1.xls is not variable names then leave the option
unselected, then SPSS will understand elements of the first row as data values.
Importing data from ASCII files
Data are often stored in an ASCII file format, alternatively known as a text or flat file format.
Typically, columns of data in an ASCII file are separated by a space, tab, comma, or some
other character. To import text files to SPSS we have two wizards to consider:
19
Read Text Data: If you know that your data file is an ASCII file, then you can open
the data file by opening the Read Text Data Wizard from the File menu. The Text
Import Wizard will first prompt you to select a file to import. After you have selected a file,
you will go through a series (about six steps) of dialog boxes that will provide you with
several options for importing data.
Once we are through with importing of the data, we need to check for its accuracy. It is also
necessary to save a copy of the dataset in SPSS format by selecting the Save or Save As
options from the File menu.
Open Data: The second option to read an ASCII file to SPSS is by using
File open Data option.
File
Open... Data
After you select data you will see a dialogue box with the header “opens file”. On the same
window, select the desktop using the Look in option
Then select Text (*.txt) from the file type drop down menu. Select the file and click on open
button. A serious of dialog boxes will follow.
Exercise: Suppose there is a text file named mychap1 on the desktop under the subdirectory
training. Import this file to SPSS. Also name the first variable as X and the second as Y.
20
2. Modifying and organizing data
To insert a case, select the row in which the case is to be added by clicking on the row's
number. Clicking on either the row's number or the column's name will result in that row or
column is being highlighted. Next, use the insert options available in the Data menu in the
Data Editor:
21
Data
Insert Variable
Insert case
If a row has been selected, choose Insert Case from the Data menu; if a column has been
selected, choose, Insert Variable. This will produce an empty row or column in the
highlighted area of the Data Editor. The existing cases and variables will be shifted down and
to the right respectively.
The new variable created is area. This is specified under target variable. This target variable is
the product of the two existing variables height and width.
22
Another example may be a dataset that contained employees' salaries in terms of their
beginning and current salaries. Our interest is on the difference between starting salary and
present salary. A new variable could be computed by subtracting the starting salary from the
present salary. See the dialogue box below
Transform
Compute...
In other situations, you may also want to transform an existing variable. For example, if data
were entered as months of experience and you wanted to analyze data in terms of years on the
job, then you could re-compute that variable to represent experience on the job in numbers of
years by dividing number of months on the job by 12.
23
NOTE: In dialog boxes that are used for mathematical or statistical operations, only those
variables that you defined as numeric will be displayed. String variables will not be displayed
in the variable lists.
Now the variable height_b is the new variable that will be obtained after recoding. The value
label for the new variable is “Height variable recoded”.
Select OLD AND NEW VALUES. This box presents several recoding options. You
identify one value or a range of values from the old variable and indicate how these
values will be coded in the new variable.
After identifying one value category or range, enter the value for the new variable in
the New Value box. In our example, the old values might be 0 through 10, and the
new value might be 1 (the value label for 1 would be "short", for 2 "medium", for 3
"tall").
Click ADD and repeat the process until each value of the new variable is properly
defined.
(See Figure Below) . Recode: Old and new values
24
Caution: You also have the option of recoding a variable into the same name. If you did this
in the height example, the working data file would change all height data to the three
categories (a value of 1 for "short"), 2 ("for medium", or 3 for "tall"). If you save this file with
the same name, you will lose all of the original height data. The best way to avoid this is to
always use the recode option that creates a different variable. Saving the data file keeps
the original height data intact while adding the new categorized variable to the data set for
future use.
Using if statement in the Data Editor
IF statement is an option to use within the compute or recode command. You can
choose to only recode values if one of your variables satisfies a condition of your choice.
This condition, which is captured by means of the "IF" command, can be simple (such
as "if area=15). To create more sophisticated conditions, you can employ logical
transformations using AND, OR, NOT. The procedure is as given below.
In the Compute and Recode dialog boxes click on the IF button.
They Include If Case Satisfies Condition dialog pops up (see the Figure below).
Select the variable of interest and click the arrow button.
Use the key pad provided in the dialog box or type in the appropriate completion of
the IF statement.
When the IF statement is complete, click CONTINUE.
25
2.5.1. Banding Values
Banding is taking two or more continuous values and grouping them into the same category.
The data you start with may not always be organized in the most useful manner for your
analysis or reporting needs. For example, you may want to:
Create a categorical variable from a scale variable.
Combine several response categories into a single category.
Create a new variable that is the computed difference between two existing variables.
Calculate the length of time between two dates.
Once again we use the data file demo.sav.
Several categorical variables in the data file demo.sav are, in fact, derived from scale variables
in that data file. For example, the variable inccat is simply income grouped into four
categories.
This categorical variable uses the integer values 1–4 to represent the following income
categories: less than 25, 25–49, 50–74, and 75 or higher.
To create the categorical variable inccat:
From the menus in the Data Editor window choose:
Transform
Visual Bander...
In the initial Visual Bander dialog box, you select the scale and/or ordinal variables for which
you want to create new, banded variables. Banding is taking two or more contiguous values
and grouping them into the same category.
Since the Visual Bander relies on actual values in the data file to help you make good banding
choices, it needs to read the data file first. Since this can take some time if your data file
contains a large number of cases, this initial dialog box also allows you to limit the number of
cases to read ("scan").
This is not necessary for our sample data file. Even though it contains more than 6,000 cases,
it does not take long to scan that number of cases.
Drag and drop Household income in thousands [income] from the Variables list into the
Variables to Band list, and then click Continue.
In the main Visual Bander dialog box, select Household income [in thousands] in the Scanned
Variable List.
26
A histogram displays the distribution of the selected variable (which in this case is highly
skewed). Enter inccat2 for the new banded variable name and Income category (in thousands)
for the variable label.
Click Make Cut points.
Select Equal Width Intervals.
Enter 25 for the first cut-point location, 3 for the number of cut-points, and 25 for the width.
The number of banded categories is one greater than the number of cut-points. So, in this
example, the new banded variable will have four categories, with the first three categories
each containing ranges of 25 (thousand) and the last one containing all values above the
highest cut-point value of 75 (thousand).
Click Apply.
The values now displayed in the grid represent the defined cut-points, which are the upper
endpoints of each category. Vertical lines in the histogram also indicate the locations of the
cut-points. By default, these cut-point values are included in the corresponding categories. For
example, the first value of 25 would include all values less than or equal to 25.But in this
example, we want categories that correspond to less than 25, 25–49, 50–74, and 75 or higher.
In the Upper Endpoints group, select Excluded (<).
Then click Make Labels.
This automatically generates descriptive value labels for each category. Since the actual
values assigned to the new banded variable are simply sequential integers starting with 1, the
value labels can be very useful.
You can also manually enter or change cut-points and labels in the grid, change cut-point
locations by dragging and dropping the cut-point lines in the histogram, and delete cut-points
by dragging cut-point lines off of the histogram. Click OK to create the new, banded
variable.
The new variable is displayed in the Data Editor. Since the variable is added to the end of the
file, it is displayed in the far right column in Data View and in the last row in Variable View.
But in this example, we want categories that correspond to less than 25, 25–49, 50–74, and 75
or higher.
In the Upper Endpoints group, select Excluded (<).
Then click Make Labels.
27
This automatically generates descriptive value labels for each category. Since the actual
values assigned to the new banded variable are simply sequential integers starting with 1, the
value labels can be very useful.
Sorting Cases
Sorting cases allows you to organize rows of data in ascending or descending order on the
basis of one or more variable. For instance consider once again the Employee data set.
Suppose we are interested to sort the data based on the variable “Jobcat” which refers to the
category of employment. The procedure for sorting will be as follows:
Data
Sort Cases...
A small dialog box with header Sort Cases will pop up. This dialogue box has few options. If
you choose the ascending option in the dialogue box and click OK, you data will be sorted by
Jobcat. All of the cases coded as job category 1 appear first in the dataset, followed by all of
the cases that are labeled 2 and 3 respectively.
The data could also be sorted by more than one variable. For example, within job category,
cases could be listed in order of their salary. Again we can choose
Data
Sort Cases...
In the small dialogue box select, select the variable jobcat followed by salary. The dialogue
box comes into view as follows.
To choose whether the data are sorted in ascending or descending order, select the appropriate
button. Let us choose ascending so that the data are sorted in ascending order of magnitude
with respect to the values of the selected variables. The hierarchy of such a sorting is
28
determined by the order in which variables are entered in the Sort by box. Data are sorted by
the first variable entered, and then sorting will take place by the next variable within that first
variable. In our case jobcat was the first variable entered, followed by salary, the data would
first be sorted by job category, and then, within each of the job categories, data would be
sorted by salary.
Merging Files:
We can merge files into two different ways. The first option is “add variables” and the
second is “add cases”.
Add variables: The Add Variables adds new variables on the basis of variables that are
common to both files. In this case, we need to have two data files. Each case in the one file
corresponds to one case in the other file. In both files each case has an identifier, and the
identifiers match across cases. We want to match up records by identifiers. First, we must sort
the records in each file by the identifier. This can be done by clicking Data, Sort Cases, and
then selecting the identifier into the “Sort by” box, OK.
Example, Given below we have a file containing dads and we have a file containing faminc.
We would like to merge the files together so we have the dadsobservation on the same line
with the faminc observation based on the key variable famid. The procedure to merge the two
files is as follows:
First sort both data sets by famid.
Retrieve the dads data set into data editor window.
Select
Data Merge files … add variables and select the file faminc. dads
famid name inc
2 Art 22000
1 Bill 30000
3 Paul 25000
faminc
famid faminc96 faminc97 faminc98
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
After merging the dads and faminc, the data would look like the following.
29
famid name inc faminc96 faminc97 faminc98
1 Bill 30000 40000 40500 41000
2 Art 22000 45000 45400 45800
3 Paul 25000 75000 76000 77000
Add variables (one to many)
The next example considers a one to many merge where one observation in one file may have
multiple matching records in another file. Imagine that we had a file with dads like we saw in
the previous example, and we had a file with kids where a dad could have more than one kid.
It is clear why this is called a one to many merge since we are matching one dad observation
to one or more (many) kids observations. Remember that the dads file is the file with one
observation, and the kids file is the one with many observations. Below, we create the data
file for the dads and for the kids.
Dads data set
Famid Name Inc
2 Art 22000
1 Bill 30000
3 Paul 25000
Kids data set
Famid Kid’s name birth age wt sex
1 Beth 1 9 60 f
1 Bob 2 6 40 m
1 Barb 3 3 20 f
2 Andy 1 8 80 m
2 Al 2 6 50 m
2 Ann 3 2 20 f
3 Pete 1 6 60 m
3 Pam 2 4 40 f
3 Phil 3 2 20 m
To merge the two data sets, we follow the steps indicated below.
SORT the data set dads by famid and save that file and call it dads2
SORT the data set kids by famid and save that file as kids2
Retrieve the data set kids2 to data editor window.
Select data …merge files… add variables.
From the dialogue box select the file dads2
30
Another dialogue box will appear. In this dialogue box we select the option “match
cases on key variables in sorted files”.
Select external file is keyed table and choose famid as key variable
Click Ok.
The Data Editor window will appear as given below.
FAMID KIDNAME BIRTH AGE WT SEX NAME INC
1.00 Beth 1.00 9.00 60.00 f Bill 30000.00
1.00 Bob 2.00 6.00 40.00 m Bill 30000.00
1.00 Barb 3.00 3.00 20.00 f Bill 30000.00
2.00 Andy 1.00 8.00 80.00 m Art 22000.00
2.00 Al 2.00 6.00 50.00 m Art 22000.00
2.00 Ann 3.00 2.00 20.00 f Art 22000.00
3.00 Pete 1.00 6.00 60.00 m Paul 25000.00
3.00 Pam 2.00 4.00 40.00 f Paul 25000.00
3.00 Phil 3.00 2.00 20.00 m Paul 25000.00
We can also retrieve the data set dads2 to data editor window and perform steps 4 to 6 for the
file kids2. This time you select working file is keyed table and choose famid as key variable.
The data editor window will appear as given below.
FAMID NAME IncKid name BIRTH AGE WT SEX
1 Bill 30000 Beth 1 9 60 f
1 Bill 30000 Bob 2 6 40 m
1 Bill 30000 Barb 3 3 20 f
2 Art 22000 Andy 1 8 80 m
2 Art 22000 Al 2 6 50 m
2 Art 22000 Ann 3 2 20 f
3 Paul 25000 Pete 1 6 60 m
3 Paul 25000 Pam 2 4 40 f
3 Paul 25000 Phil 3 2 20 m
Here the correct choice of keyed table can give us correct results.
The key difference between a one to one merge and a one to many merge is that you need to
correctly identify the keyed table. That means we have to identify which file plays the role of
one (in one to many). That file should be chosen as keyed table. In the above example the
keyed table file is only dads2 but not kids2.
Merging files (add cases option)
The Add Cases option combines two files with different cases that have the same variables.
To merge files in this option we should follow the following procedures.
Data merge files add cases
31
All variables should be listed under the small window “new working data file”. Click
Ok to complete merging.
The portion of the dialog box labeled “Unselected Cases Are” gives us the option of
temporarily or permanently removing data from the dataset.
If the “Filtered” option is selected, the selected cases will be removed from subsequent
analyses until “All Cases” option reset.
32
If the “Deleted” option is selected, the unselected cases will be removed from the
working dataset. If the dataset is subsequently saved, these cases will be permanently
deleted.
Selecting one of these options will produce a second dialog box that prompts us to a particular
specification in which we are interested. For example, if we choose the “If condition is
satisfied” option and clicking on the If button the results in a second dialog box, will appear as
shown below.
The above example selects all of the cases in the dataset that meet a specific criterion:
employees that have worked at the company for greater than six years (72 months) will be
selected. After this selection has been made, subsequent analyses will use only this subset of
the data. If you have chosen the Filter option in the previous dialog box, SPSS will indicate
the inactive cases in the Data Editor by placing a slash over the row number. To select the
entire dataset again, return to the Select Cases dialog box and select the All Cases option.
33
amount of their education, you could collapse all of the variables you want to analyze into
rows defined by the number of years of education. To access the dialog boxes for aggregating
data, follow the following steps:
Select Data and then AGGREGATE
We will observe a dialogue box. This dialogue box has several options. These are as
follows.
Break variable: The top box, labeled Break Variable(s), contains the variable within which
other variables are summarized. This is something like classification variable.
Aggregated Variables: contains the variables that will be collapsed.
Number of cases: This option allows us to save the number of cases that were collapsed at
each level of the break variable.
Save: This has three different options. I) Add the aggregated variables to working data file II)
Create new data file containing aggregated variables. III) Replace working data with
aggregated variables only. We may choose one of the above three options depending on our
interest.
Options for very large data sets: This has two options
File is already sorted on break variable(s)
Sort file before aggregating.
Example: Suppose we have a file containing information about the kids in three families.
There is one record per kid. Birth is the order of birth (i.e., 1 is first),age wt and sex are the
child's age, weight and sex respectively. This data is saved as kid3.sav file in the directory
desktop:\ training r . We will use this file for showing how to collapse data across
observations. If we consider the aggregate command under the data menu we can collapse
across all of the observations and make a single record with the average age of the kids. To do
so we need to create a break variable const=1 using the compute command.
34
To collapse the above data, we follow the following steps:
Select Data and then AGGREGATE. In the observed dialogue box, select const as
break variable.
Choose “age” for summaries of variables
Choose add aggregated variables to working data file
The “age_mean” variable will be added to our working data. This is the mean age of all 9
children.
If we follow all of the above steps and change the last option to “Create new data file
containing aggregated variables only”, we will have the following output saved as aggr.sav.
CONST AVGAGE N_Break
1.00 5.11 9
If we use “famid” as break variable, the aggregate option will the average age of the kids in
the family. The following output will be obtained.
FAMID AGE1
1.00 6.00
2.00 5.33
3.00 4.00
We can request averages for more than one variable. For instance, if we want to aggregate
both age and weight by famid we can follow the following steps.
Select Data and then AGGREGATE. In the observed dialogue box, select Constas
break variable.
Choose “age” and “wt” for summaries of variables
Choose “Create new data file containing aggregated variables only”.
The following output will be produced. The variable N_Break is the count of the number of
kids in each family.
Famid Age_mean Wt_mean N_Break
1 6.00 40.00 3
2 5.33 50.00 3
3 4.00 40.00 3
We can variable “girls” that counts the number of girls in the family, and “boys” that can
help us count the number of boys in the family. You can also add a label after the new
variable name. If you save the output in SPSS, you can see the labels in SPSS data editor after
clicking on the "variable view" tab in the lower left corner of the editor window.
35
To have summary information which shows the number of boys and girls per family, we will
follow the following procedure. We create two dummy variables Sexdum1 for girls and
Sexdum2 for boys. The sum of sexdum1 is the number of girls in the family. The sum of
sexdum2 is the number of boys in the family.
I) we recode sex into dumgirl=1 if sex=girl and dumgirl=0 if sex=m
II) We recode sex into dumboy =1 if sex=m, dumboy=0 if sex=f
III) We select Data … Aggregate option. At this step a dialogue box will appear. In this
dialogue box, we select Break-variable = famid,
IV) Select dumgirl and dumboy for aggregated variables.
V) Below the aggregated variables we have two options 1. Function 2. Name and label.
After selecting one of the variables to be aggregated choose the ‘function’ option. A new
dialogue box will pop-up.
VI) From this new dialogue box, we select the function ‘sum’ and click continue for both
variables.
VII) Again below the aggregated variables, select the option Name and label. Change the
name “dumgirl_sum” to girl and “dumboy_sum” to boy. You can also boy as number of boys
and girl as number of boys.
VIII) Now click on the number cases box and change the name N_Brake to “NumKids”.
Ix) Finally we have to choose the save option. If we choose the option “Create new data file
containing aggregated variables only”. , SPSS will save the file in the directory of our
choice.
For instance, if we save our file in the directory desktop\training r, our file will be saved as
SPSS file. Our results look like the following output.
FamId Boys girls Numkid
1 1.00 2.00 3
2 2.00 1.00 3
3 2.00 1.00 3
Restructure Data: We use Restructure data wizard to restructure our data.
In the first dialog box, we select the type of restructuring that we want to do. Suppose, we the
data that are arranged in groups of related columns. Our interest is to restructure these data
into groups of rows in the new data file. Then we choose the option restructure selected
variables into cases.
36
Example: Consider a small data set consisting of three variables as given below.
V1 V2 V3
8 63 82
9 62 87
10 64 89
12 66 85
15 67 86
The objective is then to restructure the above data into groups of rows in the new data file. In
other words we want to convert the above data into one variable that has all the values of the
three variables and one factor variable that indicate group. This procedure is known as the
restructuring of variables to cases. The procedure is as follows.
From the data menu select restructure, the dialogue box which says “Welcome to the
restructure Data wizard” will appear.
Choose the first option “Restructure selected variables into cases” and click next.
Another dialogue box which says “Variable to cases:Number of variable groups” will
appear. Choose the first option “One” and click next.
Give the name of target variable call it “ all-inone”
Select all three variables (V1, V2 and V3) to variables to be transposed box and Click next.
Another dialogue box which says “Variable to cases: Create index variable” will appear.
Choose the first option “One” and click next.
Another new dialogue box will appear here change the variable name “Index” to group.
Click finish and see your restructured data. The data may appear as shown below.
Id Group All_inone
1 1 8
1 2 63
1 3 82
2 1 9
2 2 62
2 3 87
3 1 10
3 2 64
3 3 89
4 1 12
4 2 66
4 3 85
5 1 15
5 2 67
5 3 86
37
The variable ID stands for the row position in of the data before the data was restructured. We
can also restructure the data from cases to variables.
For instance, consider the following small data set on age of Nurses and Doctors.
The variable group 1 stands for nurses and 2 stands for doctors.
Id Age Group
1 23 1
2 25 1
3 26 1
4 35 1
5 42 1
6 22 1
1 60 2
2 36 2
3 29 2
4 56 2
5 32 2
6 54 2
The objective is to restructure the above age data into a data set having two separate variables
for Nurses and Doctors. To do so, we follow the following procedure.
From the data menu we select restructure
From the dialogue box we select “ Restructure selected cases to variables ”
We select Id for Identifier variable
We select group for Index variable and click next and respond to the dialogue box that
will appear.
When you observe the dialogue box which says “Cases to variables:Options”
dialogue box select group by Index and click next.
Click finish.
Our data will be restructured as given below.
Id Age.1 Age.2
1 23 60
2 25 36
3 26 29
4 35 56
5 42 32
6 22 54
38
Transpose all data. We choose this when we want to transpose our data. All rows will
become columns and all columns will become rows in the new data. The procedure is as
follows:
From the data menu we select restructure
From the dialogue box we select “ Transpose all data ” and click finish
Transpose dialogue box will appear. We have to select all variables to transpose. (Note
un-selected variables will be lost.) Click Ok.
The transformed data that change rows to columns and columns to row will appear.
Example: Consider the following data set.
Id Age group
1 23 1
2 25 1
3 26 1
4 35 1
5 42 1
6 22 1
7 60 2
8 36 2
9 29 2
10 56 2
11 32 2
12 54 2
Applying the above procedure, the transposed form of this data is as given below.
Case_lbl V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
Id 1 2 3 4 5 6 7 8 9 10 11 12
age 23 25 26 35 42 22 60 36 29 56 32 54
group 1 1 1 1 1 1 2 2 2 2 2 2
39
the Output Viewer. The procedure for doing this cannot be performed using dialog boxes and
is available only through command syntax. The syntax for generating a list of cases is shown
in the Syntax Editor window below. The variable names shown in lower case below instruct
SPSS which variables to list in the output. Or, you can type in the command ALL in place of
variables names, which will produce listing of all of the variables in the file. The sub-
command /CASES FROM 1 TO 10, is an instruction to SPSS to print only the first ten cases.
If this instruction were omitted, all cases would be listed in the output.
To execute this command, first highlight the selection by pressing on your mouse button while
dragging the arrow across the command or commands that you want to execute. Next, click on
the icon with the black, right-facing arrow on it. Or, you can choose a selection from the Run
menu.
Executing the command will print the list of variables, gender and minority in the above
example, to the Output Viewer. The Output Viewer is the window in which all output will be
printed. The Output Viewer is shown below, containing the text that would be generated from
the above syntax.
40
3. DESCRIPTIVE STATISTICS USING SPSS
From the previous section, we have seen in the Output Viewer:
The results from running a statistical procedure are displayed in the Viewer.
The output produced can be statistical tables, charts or graphs, or text, depending on
the choices you make when you run the procedure.
The viewer window is divided into two panes.
The outline pane (left side): contains an outline of all of the information stored in the
Viewer.
The contents pane (right hand side): contains statistical tables, charts, and text output.
The icons in the outline pane can have two forms namely:
The open book icon: indicates that it is currently visible in the Viewer
The closed book icon: indicates that it is not currently be visible in the viewer.
41
3.1.2. Descriptive Statistics
The Descriptive Statistics table provides summary statistics for continuous,
numeric variables.
From the menu bar chose:
Analyze
Descriptive Statistics
Descriptive...
Descriptive option available from the Analyze and Descriptive
Statisticsmenus:This will produce the following dialog box
42
The options available by selecting the Statistics and Cells buttons provide you with
several additional output features.
Selecting the Cells button will produce a menu that allows you to add additional values
to your table.
Three way tables (Adding a layer variable):
43
Displaying Tables:
Tables
Much of the output in SPSS is displayed in a pivot table format .
The procedure for creating tables from the menu-bar select
Analyze
Table
Custom Tables.
Then simply drag and drop variables where we want them to appear in the table.
Step:- Analyze Tables Custom Tables Drag Categorical Variables Ok
Summary Statistics:
Right-click on variable category on the canvas pane and select Summary Statistics
from the pop-up context menu.
In the Summary Statistics dialog box, select Row N % in the Statistics list and click
the arrow button to add it to the Display list.
Both the counts and row percentages will be displayed in the table.
Click Apply to Selection to save these settings and return to the table builder.
To insert totals and subtotals click categories and totals in the define section
Then click ok.
For scale variable we can display summaries statistics ( mean, median, …) in the
cells of the table
Stacking Variables:
Taking separate tables and pasting them together into the same display.
To Stack Variables:
In the variable list, select all of the variables you want to stack, then
drag and drop them together into the rows or columns of the canvas
pane. Or
Drag and drop variables separately, dropping each variable either
above or below existing variables in the rows or to the right or left
of existing variables in the columns.
44
3.1.4. Diagrams and graphs
A. Bar Chart
Bar Charts are a common way to graphically display the data that represent the
frequency of each level of a variable
Graphs
Bar...
To get started with the bar graph, click on the icon representing the type of graph that
you want, then click on the Define button to produce the following dialog box
45
B. Pie Chart
used to present categorical variable
From the menu bar choose
Graph
Pie chart
C. Histograms
Making histograms is one of the best ways to check your data for normality.
D. Scatter Plots
Scatter plots give you a tool for visualizing the relationship between two or more
variables
Scatter plots are especially useful when you are examining the relationship between
continuous variables using statistical techniques such as correlation or regression.
Scatter plots are also often used to evaluate the bivariate relationships in regression
analyses.
Useful in the early stage of analysis when exploring data and determining is a linear
regression analysis is appropriate
May show outliers in your data
Example: Performance and Self-confidence
To obtain a scatter plot in SPSS
Graphs
Scatter...
46
Simple Scatter Plot
The Simple scatter plot graphs the relationship between two variables
When you select the Simple option from the initial dialog box, you will get the
following dialog box:
We can also have SPSS draw different colored markers for each group by entering a
group variable in the Set Markers by box.
The Matrix Scatter Plot:
47
Every combination is plotted twice so that each variable appears on both the X and Y
axis.
Considerer a Matrix scatter plot with three variables, salary, salbegin, and jobtime,
you would receive the following scatterplot matrix:
48
Exercise:
1. Let us consider a small data set given below.
x 400 675 475 350 425 600 550 325 675 450
y 1.8 3.8 2.8 1.7 2.8 3.1 2.6 1.9 3.2 2.3
After entering these data into SPSS plot the scatter plot. What type relationship do you
observe between x and y? Is an increase in x followed by an increase in y?
2. Produce a scatter plot for the following data and discuss the results.
x 400 675 475 350 425 600 550 325 675 450
y -1.8 -3.8 -2.8 -1.7 -2.8 -3.1 -2.6 -1.9 -3.2 -2.3
49
4. Customizing SPSS outputs and reporting
Much of the output in SPSS is displayed in a pivot table format. While these pivot tables are
professional quality in their appearance, you may wish to alter their appearance or export
them to another application. There are several ways in which you can modify tables. In this
section, we will discuss how you can alter text, modify a table's appearance, and export
information in tables to other applications.
To edit the text in any SPSS output table, you should first double-click that table. This will
outline dashed lines, as shown in the figure below, indicating that it is ready to be edited.
Some of the most commonly used editing techniques are changing the width of rows and
columns, altering text, and moving text. Each of these topic are discussed below:
Changing column width and altering text. To change column widths, move the
mouse arrow above the lines defining the columns until the arrow changes to a double-
headed arrow facing left and right. When you see this new arrow, press down on your
left mouse button, then drag the line until the column is the width you want, then
release your mouse button.
Editing text. First double-click on the cell you wish to edit, then place your cursor on
that cell and modify or replace the existing text. For example, in the frequency table
shown below, the table was double-clicked to activate it, and then the pivot table's title
was double-clicked to activate the title. The original title, "Employment Category,"
was modified by adding the additional text, "as of August 1999."
Using basic editing commands, such as cut, copy, delete, and paste. When you cut
and copy rows, columns, or a combination of rows and columns by using the Edit
menu's options, the cell structure is preserved and these values can easily be pasted
into a spread sheet or table in another application.
50
Aside from changing the text in a table, you may also wish to change the appearance of the
table itself. But first, it is best to have an understanding of the SPSS Table Look concept. A
Table Look is a file that contains all of the information about the formatting and appearance
of a table, including fonts, the width and height of rows and columns, colouring, etc. There are
several predefined Table Looks that can be viewed by first right-clicking on an active table,
then selecting the Table Looks menu item. Doing so will produce the following dialog box:
51
You can browse the available Table Looks by clicking on the file names in the Table Look
Files box, as shown above. This will show you a preview of the Table Look in the Sample
box.
While the Table Looks dialog box provides an easy way to change the look of your table, you
may wish to have more control of the look or create your own Table Look. To modify an
existing table, right-click on an active pivot table, then select the Table Properties menu item.
This will produce the following dialog box:
The above figure shows the Table Properties dialog box with the Cell Formats tab selected.
You can alternate between tabs (e.g., General, Footnotes, etc.) by clicking on the tab at the
upper left of the dialog box. While a complete description of the options available in the Table
Properties dialog box is beyond the scope of this document, there are a few key concepts that
are worth mentioning. Note the Area box at the upper right of the dialog box. This refers to
the portion of the box that is being modified by the options on the left side of the box. For
example, the color in the Background of the Data portion of the table was changed to black
and the color of the text was changed to white by first choosing Data from the Area box, then
52
selecting black from the Background drop-down menu and selecting white for the text by
clicking on the colour palette icon in the Text area on the left side of the dialog box.
The Printing tab also has some useful options. For example, the default option for three-
dimensional tables containing several layers is that only the visible layer will be printed. One
of the options under the Printing tab allows you to request that all layers be printed as
individual tables. Another useful Printing option is the Rescale wide/long tables to fit page,
which will shrink a table that is larger than a page so that it will fit on a single page.
Any modifications to a specific table can be saved as a Table Look. By saving a Table Look,
you will be saving all of the layout properties of that table and can thus apply that look to
other tables in the future. To save a Table Look, click on the General tab in the Table
Properties dialog box. There are three buttons on the bottom right of this box. Use the Save
Look button to save a Table Look. That button will produce a standard Save As dialog box
with which you can save the Table Look you created.
In addition to modifying a table's appearance, you may also wish to export that table. There
are three primary ways to export tables in SPSS. To get a menu that contains the available
options for exporting tables, right-click on the table you wish to export. The three options for
exporting tables are: Copy, Copy object, and Export.
The Copy option copies the text and preserves the rows and columns of your table but does
not copy formatting, such as colours and borders. This is a good option if you want to modify
the table in another application. When you select this option, the table will be copied into your
system clipboard. Then, to paste the table, select the Paste command from the Edit menu in
the application to which you are importing the table. The Copy option is useful if you plan to
format your table in the new application; the disadvantage of this method is that only the text
and table formatting remains and you will therefore lose much of the formatting that you
observe in the Output Viewer.
The Copy object method will copy the table exactly as it appears in the SPSS Output Viewer.
When you select this option, the table will be copied into your clipboard and can be imported
53
into another application by selecting the Paste option from the Edit menu of that application.
When you paste the table using this option, it will appear exactly as it is in the Output Viewer.
The disadvantage of this method is that it can be more difficult to change the appearance of
the table once it has been imported.
The third method, Export, allows you to save the table as an HTML or an ASCII file. The
result is similar to the Copy command: you will have a table that retains the text and cell
layout of the table you exported, but it will retain little formatting. This method for exporting
tables to other applications is different from the above two methods in that it creates a file
containing the table rather than placing a copy in the system clipboard. When you select this
method, you will immediately be presented with a dialog box allowing you to choose the
format of the file you are saving and its location on disk. The primary advantage of this
method is that you can immediately create an HTML file that can be viewed in a Web
browser.
Some of the most useful options that will add information to your scatterplot are the
Fit Line options.
54
The Fit Line option will allow you to plot a regression line over your scatter plot.
Click on the Fit Options button to get this dialog box:
The primary tool for modifying charts in SPSS is the Chart Editor. The Chart Editor will open
in a new window, displaying a chart from your Output Viewer. The Chart Editor has several
tools for changing the appearance of your charts or even the type of chart that you are using.
To open the Chart Editor, double-click on an existing chart and the Chart Editor window will
open automatically. The Chart Editor shown below contains a bar graph of employment
categories:
55
While there are many useful features in the Chart Editor, we will concentrate on the three of
them: changing the type of chart, modifying text in the chart, and modifying the graphs.
You can change the type of chart that you are using to display your data using the Chart
Editor. For example, if you want to compare how your data would look when displayed as a
bar graph and as a pie chart, you can do this from the Gallery menu:
Gallery
Pie...
Selecting this option will change the above bar graph into the following pie chart:
56
Once you have selected your graphical look, you can start modifying the appearance of your
graph. One aspect of the chart that you may want to alter is the text, including the titles,
footnotes, and value labels. Many of these options are available from the Chart menu. For
example, the Title option could be selected from the Chart menu to alter the charts title:
Chart
Title...
Selecting this menu item will produce the following dialog box:
The title "Employment Categories" was entered in the box above and the default justification
was changed from left to center in the Title Justification box. Clicking OK here would cause
this title to appear at the top center of the above pie chart. Other text in the chart, such as
footnotes, legends, and annotations, can be altered similarly. The labels for the individual
slices of the pies can also be modified, although it may not be obvious from the menu items.
To alter the labels for areas of the pie, choose the Options item from the Chart menu.
Chart
Options...
57
In addition to providing some general options for displaying the slices, the Labels section
enables you to alter the text labelling slices of the pie chart as well as format that text. You
can click the Edit Text button to change the text for the labels. Doing so will produce the
following dialog box:
To edit individual labels, first click on the current label, which will be displayed below the
Label box, then alter the text in the Label box. When you finish, click the Continue button to
return to the Pie Options dialog box. You can make changes to the format of your labels by
clicking the Format button here. If you do not want to change formatting, click on OK to
return to the Chart Editor.
In addition to altering the text in your chart, you may also want to change the appearance of
the graph with which you are working. Options for changing the appearance of graphs can be
accessed from the Format menu. Many options available from this menu are specific to a
particular type of graph. There are some general options that are worth discussing here. One
such is Fill Pattern option, which changes the pattern of the graph. It can be obtained by
selecting the Fill Pattern option from the Format menu:
Format
Fill Pattern...
This will produce the following dialog box:
58
First, click on the portion of the graph where you want to change the pattern, then select the
pattern you want by clicking on the pattern sample on the left side of the dialog box. Then,
click the Apply button to change the appearance of your graph.
One other formatting option that is generally useful is the ability to change the colors of your
graphs. To do that, select the Color option from the Format menu:
Format
Color...
This will allow you to change the color of a portion of a graph and its border. First, select the
portion of the graph for which you would like to change its color, then select the Fill option if
you want to change the color of a portion of the graph and select the Border option if you
want to change the color of the border for a portion of the graph. Next, click on the color that
you want and click Apply. Repeat this process for each area or border in the graph that you
want to change.
Interactive Charts
Many of the standard graphs available through SPSS are also available as interactive charts.
Interactive charts offer more flexibility than standard SPSS graphics: you can add variables to
an existing chart, add features to the charts, and change the summary statistics used in the
chart. To obtain a list of the available interactive charts, select the Interactive option from the
Graphs menu:
59
Graphs
Interactive
Selecting one of the available options will produce a dialog for designing an interactive graph.
For example, if you selected the Boxplot option from the menu, you would get this dialog
box:
Dialog boxes for interactive charts have many of the same features as other SPSS dialog
boxes. For example, in the above dialog box, the variable type is represented by icons: scale
variables, such as the variable bdate, are represented by the icon that resembles a ruler, while
categorical variables, such as the variable educ, are represented by the icon that resembles a
set of blocks. Variables in the variable list on the left of the dialog box can be moved into the
boxes on the right side of the screen by dragging them with your mouse, in contrast to using
the arrow button used in other SPSS dialog boxes. Options in interactive graphs can be
accessed by clicking on the tabs. For example, clicking on the Boxes tab produces the
following dialog box:
60
Here, you have several choices about the look of your boxplot. The choice to display the
median line is selected here, but the options to indicate outliers and extremes are not selected.
The Titles and Options tabs offer several other choices for altering the look of your table as
well, although a thorough discussion of these is beyond the scope of this document. When you
have finished the specifications for a graph, click the OK button to produce the graph you
have specified in the Output Viewer.
Interactive graphs offer several choices for altering the look of the chart after you have a draft
in the Output Viewer. To get the menus for doing that, double-click on the interactive graph
that you want to alter. For example, double-clicking on the boxplot obtained through the
above dialog box will produce the following menus:
The icons immediately surrounding the graph provide you with several possibilities for
altering the look of your graph. The three leftmost items in the horizontal menu are worthy of
mention. The leftmost icon produces a dialog box that resembles the original Interactive
Graphs dialog box and contains many of the same options. For example, you could change the
variables that you are graphing using this dialog box. The next icon, the small bar graph, lets
you add additional graphical information. For example you could overlay a line that graphed
the means of the three groups in the above graph by choosing the Dot-Line option from the
menu, or you could add circles representing individual’s salaries within each group by
choosing the Cloud option. The third icon provides several options for changing the look of
your chart. Selecting that icon will produce the following dialog box:
61
Each icon in this dialog box can be double-clicked to produce a dialog box that contains the
properties of the component of the chart represented by that icon. For example, you could
obtain the properties of the boxes in the interactive graph above by double-clicking on the
icon labelled Box. Doing so would produce this dialog box:
Changing the properties in this or any other dialog box that controls the properties of any
portion of the chart will change the look of the graph in the Output Viewer. For example, you
could change the colors of the boxes and their outlines by selecting a different
62
5. INTRODUCTION TO MINITAB
Minitab is Statistical Analysis software that allows to easily conducting analyses of data. This
is one of the suggested software for the class. It is commonly used to enter, organize, present
and analyze any certain data of a given variable. It can be used for learning about statistics as
well as to undertake statistical researches. Its applications have the advantage of being
accurate reliable and generally faster than computing statistics and drawing graphs by hand.
This guide is intended to guide you through the basics of Minitab and help you get started
with it.
C. LOADINGDATAIN MINITAB
Minitab files are organized as “projects”. Each project will contain all the data you use and the
commands and analysis you perform on the data.
You can open a new, empty worksheet at any time. In this empty worksheet you can copy,
63
paste and type the data you need by simply working on the worksheet as you would on any
spreadsheet.
D. Opening an existing Worksheet(Minitab type file)
Within a project you can open one or more files that contain data. When you open a file, you
copy the contents of the file into the current Minitab project. Any changes you make to the
worksheet while in the project will not affect the original file.
To open a Minitab type file
1. Choose FILE-> OPEN WORKSHEET
2. Look for the file you want to open. Should be a.MTW or.MPJ type file. Select the
file and click Open.
3. If you get a message box indicating that the content of the file will be added to
the current project, check “Do not display this message again”, and then click OK.
64
Minitab has a large number of built-in routines that allows you to do most of the basic data
analysis. Commands can also be typed in to the Session Window, to either replicate the built-
in routines or to create a more tailored data analysis.
The “MTB>”prompt should be visible in the Session Window.
65
Window
Then choose your desired window from the given list and click on it. The report pad is
accessible through the project manager.
Alternatively, each window is represented by an icon on the top bar. Clicking on the icon will
take you to the window right way. In particular, note the icons for worksheet, session, and
report pad.
Title bar
Menu
bar
Standard
tool bar
Session
Window
Column
names
Row
names
Worksheet
Project
manager
Status bar
66
5.3. The menu and their use
There are 4 areas in the screen, the Menu bar, the Toolbar, the Session, Window and the
Worksheet window.
You can open menus and choose commands. Here you can find the built-in routines.
File -use this menu to open and save worksheets and to import data.
Edit -use this menu to cut and paste text and data across windows.
Manip-use this menu to sort and recode your data.
Calc-use this menu to create new columns.
Stat -use this menu to analyses your data. This key menu performs many useful
statistical functions
Graph -use this menu to graphically represent your data.
Editor -use this menu to edit and format your data.
Window -use this menu to change windows.
Help - this opens a standard Microsoft Help window containing information on
how to use the many features of Minitab.
This section discusses the types of data you can work with in MINITAB and the
various forms those data types can take. In MINITAB you can work with 3 types of
data in three forms: columns, constants, or matrices, these are
1. Numeric: It includes digits 0, 1 … 9 and *. But the symbol * is reserved for missing
value. The number can have a – or + sign, also it can be written in exponential
notation if it is very large or very small number. e.g. 3.2E12 which is equal with
3.2×1012. Numbers can be stored in columns, constants or matrices. MINITAB stores
and computes numbers in double precision, which means that numbers can have up
to15 or 16 digits (depending on the number) without round-off error.
2. Text: It can be two types either character or string. Characters are a single alphabet,
digits (from 0 to 9), spaces and punctuation marks such as >, ?, <, !.... Strings are a
series of characters; some examples of strings are country, name, occupation etc. The
67
maximum number of characters that can be entered at a time is 80. Texts can be stored
in columns or constants but not in matrices.
3. Date/Time: You can write Date (Such as Jan-1-1997, 03/01/2011…) or Times (Such
as 24:23) or both (Such as 24/11/2002; 10:30AM)
There are two main ways to enter data into the Minitab worksheet:
1. Typing in the values (of give variable) one by one and clicking <enter> after
each entry.
2. Opening an existing Minitab worksheet or Minitab project.
68
FILE + OPEN WORKSHEET or FILE + OPEN PROJECT
Then select the file from appropriate drive/folder
Minitab files have a yellow icon with MTB written on them.
Note (i) Data sets from the textbook are available on the CD-ROM attached to the
book. They organize by chapters. The files may need to be unzipped.
(ii) A new spreadsheet is created each time you open a data file.
You can merge the spreadsheet only; you need to use
FILE + MERGE WORKSHEET
(iii) The “open file” icon defaults to Minitab project file only.
To open a spreadsheet only, you need to use FILE + OPEN WORKSHEET.
Entering Data into a Worksheet
There are various methods for entering data into a worksheet. The simplest approach is
to use the Data window to enter data directly into the worksheet by clicking your mouse
in a cell and then typing the corresponding data entry and hitting Enter. Remember that
you can make a Data window active by clicking anywhere in the window or by using
Windows in the menu bar.
If you type any character that is not a number, Minitab automatically identifies the
column containing that cell as a text variable and indicates that by appending T to the
column name, e.g., C5-T in Display I.4. You do not need to append the T when
referring to the column. Also, there is a data direction arrow in the upper left corner of
the data window that indicates the direction the cursor moves after you hit Enter.
Clicking on it alternates between row-wise and column wise data entry. Certainly, this
is an easy way to enter data when it is suitable.
Remember, columns are variables and rows are observations! Also, you can have
multiple data windows open and move data between them. Use the command to open a
new worksheet.
69
Save Current Worksheet to save the worksheet with its current name, or the default
name if it doesn’t have one.
The Save in box at the top contains the name of the folder in which the worksheet will
be saved once you click on the Save button. Here the folder is called data, and you can
navigate to a new folder using the Up One Level button immediately to the right of this
box. The next button takes you to the Desktop and the third button allows you to create
a subfolder within the current folder. The box immediately below contains a list of all
files of type .mtw in the current folder.
You can select the type of file to display by clicking on the arrow in the Save as type
box, which we have done here, and click on the type of file you want to display that
appears in the drop-down list.
There are several possibilities including saving the worksheet in other formats, such as
Excel. Currently, there is only one .mtw file in the folder data and it is called
marks.mtw. If you want to save the worksheet with a different name, type this name in
the File name box and click on the save button.to retrieve a worksheet, use File I Open
Worksheet and file in the dialog box as depicted in Display I.20 appropriately. The
various windows and buttons Minitab for Data Management 27 in this dialog box work
as described for the File I Save Current Worksheet As command, with the exception
that we now type the name of the file we want to open in the File name box and click
on the Open button
To set up a connection between Minitab and Excel, we need to tell Minitab the file path
(directories, folders, etc.) to where that Excel file lives. The simplest import of an Excel
file is by using the File > Open Worksheet command in Minitab.
In the Open Worksheet dialog box, the first step is to click the “Files of Type” drop-
down list and choose “All.” This lets us see all file types in the folder. Navigate to your
Excel file and select it.
70
But before you click “Open,” take a look at the buttons that appear at the bottom of the
dialog box after you select the Excel file. Click “Preview” to view how Minitab is
recognizing the data in the worksheet. Then you can click “Options” to specify which
data in the worksheet you want to import.
Since Excel is a general, cell-based spreadsheet, your document may have data in any
row or column with formulas scattered in between. Minitab, as a statistical software
package, requires the data to be in column-wise format (which is why it's easy to
manipulate data with the Data menu in Minitab). Because of this difference, you want
to avoid bringing over any header or footer information from Excel. Just focus on
bringing over the raw dataset into Minitab. Use the Open Worksheet > Options box to
specify exactly which rows to import.
71
4. Go to the “SINGLE CHARACTER SEPARATOR” option. The data on the text file
is usually separated by spaces or tabs. Choose the appropriate option. If you are unsure
how the data is separated, another option is to use the number of data rows. Just
introduce the number of data rows in the “NUMBER OF DATA ROWS” box.
5. Click OK.
6. The results will appear in the worksheet window.
72
Note: This can be sometimes a little tricky as you can get a file that does not have the
data in the format that you want. If this happens, close the worksheet where the data is
placed and try importing it again, changing some of the options in step 4.This is a trial
and error procedure; so don’t panic if you don’t get it in the first attempt.
Copying data to Minitab
Copying data to Minitab works like copying data to any other type of spreadsheet(eg.
Excel).
1. Copy the data you wish to use in Minitab.
2. Go to the position where you want to copy the data in the desired Minitab
worksheet. If you wish to paste a cell with a Header or Name, make sure that you stand
in the variable name cell (cell below the number of the Column C1, C2, etc).
3. Go to EDIT ->PASTE CELLS to paste the data.
4. Sometimes when you copy data, Minitab reads it in a wrong format, eg. As a text
When is numeric. To solve this problem, select the problematic column(s) and go to
DATA -> CHANGE DATA TYPE-> CHOOSETHE DESIRED FORMAT. The most
useful format is numeric.
The following dialog box appears. Choose the variables you want to modify and where
you want to store them. The storage variables can be the same variables as the ones you
are modifying. Then hit OK.
73
5.6.3. Export data
To export data, you can save the Minitab worksheet as a different file type. Choose File
> Save Current Worksheet As to save the following types of files in Minitab:
To import data into Access, first, save the Minitab worksheet as an Excel file. Then,
import the Excel file into Access.
74
4. Click Save.
2. Import the Excel file into Access. Consult the Access Help for details.
When you save your worksheet as a text file, Minitab saves date/time data in the same
format in which it is displayed in the worksheet. Thus, if dates are displayed in the
format mm/dd/yyyy, then only the date is saved and not the hidden components, such
as the time.
When you save your worksheet as a file type other than text, Minitab saves all the
date/time information. For example, if dates in a column are displayed in the format
mm/dd/yyyy and you save the worksheet as an Excel file, when you open that file in
Excel, your spreadsheet will include both the date and time information: mm/dd/yyyy
h:mm.ss.ss.
If you use Save Current Worksheet As to save the worksheet as a text file, you cannot
specify the columns to save. You also cannot save your data in a custom format, for
example, with line breaks after certain columns. If you want to have more control over
how text files are saved, use File > Other Files > Export Special Text.
75
6. Descriptive statistics using Minitab
Descriptive Statistics
Displays N, N*, Mean, SE Mean, StDev, Min, Q1, Median, Q3, and Max
Descriptive Statistics for one variable
Stat Basic Statistics Display Descriptive Statistics Double-click on
appropriate variable (For Dell Data, double-click on Rates of Return so that it is
displayed under Variables).
As you can see from the screen above, you are given the option to alter the output by
clicking on the buttons. If you click on the Statistics button, this screen will appear:
76
The checked items will be displayed in the output. To check or uncheck an item, click
in the box to the left of the word.
If you click on the Graphs button, this screen will appear:
To display any of these graphs (in addition the descriptive statistics displayed in the
session window), click in the box. (For purposes of this example, I have not clicked on
any graphs since graphs will be explained in the next section.)
To display the data, click on OK. For the Dell example, this information is displayed in
the session window:
Descriptive Statistics: Rates of Return
Variable N N* Mean SE Mean StDev Minimum Q1 Median
Rates of 60 0 0.0907 0.0195 0.1511 -0.2175 -0.0304 0.0784
Return
Variable Q3 Maximum
Rates of Return 0.1931 0.4561
b) Descriptive statistics for one variable, grouped by a second variable
Stat Basic Statistics Display Descriptive Statistics Double-click on
appropriate variable Click in By variables (optional) box and then double-click
on appropriate variable OK. (For Auction Data, double-click on Auction Price so
that it is displayed under Variables. Then move the cursor into the By variables
(optional) box and double-click on No. of bidders so that it is displayed under By
variables (optional).)
77
For the Auction Data example, this information is displayed in the session window:
Note: If you see a * in the output, that indicates that the value could not be calculated.
In this example, the numerous * appear because N is not large enough in each group to
calculate all the descriptive statistics. (e.g. There is only one instance where the number
of bidders equals 5, and thus SE Mean, StDev, Q1, and Q3 could not be calculated with
only one data point)
78
C) Store Descriptive Statistics
This feature adds the descriptive statistics to the data worksheet instead of displaying
the output in the session window:
Stat Basic Statistics Store Descriptive Statistics Double-click on
appropriate variable (For Dell Data, double-click on Rates of Return so that it is
displayed under Variables).
As you can see from the screen above, you are again given the option to alter the output
by clicking on the buttons. If you click on the Statistics button, this screen will appear:
79
d) Column Statistics
You can calculate various statistics on columns. Column statistics are displayed in the
Session window, and are optionally stored in a constant.
Calc Column Statistics Click by the Statistic you want calculated (For Auction
Data, click by Standard Deviation) Double-click on appropriate column in Input
variable box (Double-click on No. of Bidders) OK.
80
This output is displayed in the session window:
Standard Deviation of No. of Bidders
Standard deviation of No. of Bidders = 2.83963
e) Row Statistics
You can compute one value for each row in a set of columns. The statistic is calculated
across the rows of the column(s) specified and the answers are stored in the
corresponding rows of a new column.
Calc Row Statistics Click by the Statistic you want calculated Double-click
on appropriate variable(s) in Input variables box Type the name of the new
column that will be created OK.
Calculating Row Statistics does not make sense using the example data because it is not
meaningful in context. Thus, an example is not given here. However, in order to see
what row statistics are able to be calculated, the screen shot is shown below.
81
Graphs
a) Histogram
Using the Dell Data that is now inserted into Minitab, a histogram can be made by
going to Graph Histogram Then this screen will appear:
Click on appropriate graph and then click OK. (For this example, we will display the
simple histogram). Double-click on appropriate variable (For Dell Data, double-
click on Rates of Return so that it is displayed under Graph Variables) OK.
82
Note: You are able to edit the graph at this point. On the graph below, the arrows
represent where you can double-click to make changes to the graph. You can do this
type of editing on most graphs.
Let’s say you wanted to edit the scale on the x-axis. By double-clicking on any of the x-
axis numbers (For this Dell example, you could double-click on -0.16), this screen
willthen appear:
83
This screen shows the Scale tab. Another way to edit the scale is to click on the Binning
tab. By doing so, this screen will appear:
(The default sets the Interval Definition to Automatic. However, for this Dell example,
click by Midpoint/Cut point positions and replace the numbers given with the new
numbers shown above.)
If you click on the Show tab, this screen will appear:
84
If you click on the Labels tab, this screen will appear:
(The default is set to Tahoma Font, Size 10. For this example, choose Lucida
Handwriting Font, Size 12.)
85
If you click on the Alignment tab, this screen will appear:
As you can see, the binning, size, and font have been changed in this example. Since
we originally double-clicked on one of the x-axis numbers, we were able to make
changes regarding that aspect of the graph. Likewise, you can make changes to other
parts of the graph by double-clicking on the appropriate spot. The details for all the
86
other arrows (displayed on page 26) are not going to be explained here. Basically, you
can change the way the text, bars, and background are displayed.
Another way to alter graphs is to use the buttons. If we go back to our original
histogram example, after going to Graph Histogram OK Double-clicking on
appropriatevariable, we are back to this screen:
Here you are given the option to alter the output by clicking on the buttons. If you click
on the Scale button, this screen will appear:
87
This screen shows the Axes and Ticks tab. If you click on the Y-Scale Type tab, this
screen will appear:
(The default is set for Percent, but for this Dell example, click by Frequency.)
If you click on the Gridlines tab, this screen will appear
If you click on the Reference Lines tab, this screen will appear:
88
(There are no references lines by default, but for this example, type 6 to show a
reference line at y = 6.)
If you click on the Labels button, this screen will appear:
(The default is set for None, but click by Use y-value labels for this example.)
If you click on the Data View button, this screen will appear:
89
This screen shows the Data Display tab. If you click on the Distribution tab, this screen
will appear:
If you click on the Multiple Graphs button, this screen will appear:
90
This screen shows the Multiple Variables tab. If you click on the By Variables tab, this
screen will appear:
If you click on the Data Options button, this screen will appear:
This screen shows the Subset tab. If you click on the Group Options tab, this screen
will appear:
This screen shows the Subset tab. If you click on the Group Options tab, this screen
will appear:
91
To display the graph, click on OK. The histogram will display:
Dotplot
Graph Dotplot Then this screen will appear:
92
Click on appropriate graph and then click OK. (For this example, we will display the
simple dotplot). Double-click on appropriate variable (For Dell Data, double-click
on Rates of Return so that it is displayed under Graph Variables) OK.
93
Click on appropriate graph and then click OK. (For this example, we will display the
simple boxplot). Double-click on appropriate variable (For Dell Data, double-click
on Rates of Return so that it is displayed under Graph Variables) OK.
Click on appropriate graph and then click OK. (For this example, we will display the
single probability plot). Double-click on appropriate variable (For Dell Data,
double-click on Rates of Return so that it is displayed under Graph Variables) OK.
94
This Probability Plot will display:
e) Graphical Summary
Stat Basic Statistics Graphical Summary Double-click on appropriate
variable (For Dell Data, double-click on Rates of Return so that it is displayed under
Variables) OK.
95
Note: The By variables option is used to create multiple graphical summaries based on
a type of grouping variables, called a by variable. For an example using the Auction
Data, if use Auc Price as the Variable and No. of Bidders as the By variable,
96
the output will display a graphical summary for every group of number of bidders. Here
is one of the graphical summaries that is displayed:
Thus, only the auction prices for when the number of bidders = 9 is shown.
f) Bar Chart
i) Bars representing counts of unique values
Choose this graphical format if you have one or more columns of categorical data and
you want to chart the frequency of each category.
Graph Bar Chart Choose Counts of unique values from the drop box and Click
OK. (For this example, we will use the Student Data and show a simple Bar Chart.)
97
As you can see from the screen above, you are given the option to alter the output by
clicking on the buttons. If you click on the Chart Options button, this screen will
appear:
98
If we would have chosen Decreasing Y instead of Default after clicking on the Bar
Chart Options button,
99
Bars representing a function of a variable
Choose if you have one or more columns of data and you want to chart a function of the
data. Quite a few of these functions are summary statistics.
Graph Bar Chart Choose A function of a variable from the drop box (Then, for
this example, we will click on Cluster under Multiple Y’s) and Click OK.
Click on appropriate variable from the drop box to choose a function. (Here we’ll choose mean)
100
categorical variables.) OK.
101
Double-click on appropriate variable in Graph variables box and then double-click
on appropriate variable in the Categorical variable. (For Student Data, age was put
under the graph variable and gender was put under the categorical variable.) OK.
Although it does not provide much use in context to sum the ages of males versus
females, this example was completed to showcase the use of this function.
102
g) Pie Chart
i) Chart raw data
Choose when each row in a column represents a single observation. Each slice in the
pie is proportional to the number of occurrences of a value in the column.
Graph Pie Chart Click on Chart raw data Double-click on appropriate
variable in Categorical variables box (For Student Data, double-click on Portfolio).
As you can see from the screen above, you are given the option to alter the output by
clicking on the buttons. If you click on the Pie Options button, this screen will appear:
103
ii) Chart values from a table
Choose when the category names are in one column and summary data are in another
column.
Let’s look at how to use a pie chart if our data was organized differently. (Look at
Student (2) Data)
Graph Pie Chart Click on Chart values from a table Double-click on
appropriate variable in Categorical variable box and double-click on appropriate
variable in the Summary variables box. (For Student (2) Data, double-click Gender
for Categorical variable and Count for Summary variables.)
104
7. STATISTICAL ANALYSIS USING MINITAB AND SPSS
7.1. Inferential statistics Using Minitab
Ways to Analyze Data
Analysis in Minitab can be done in two ways: using the Built-In routines or using
command language in the Session window. These two can be used interchangeably.
Built-In routines
Most of the functions needed in basic and more advanced statistical analysis are found
as Minitab Built-In routines. These routines are accessed through the menu bar. To use
the menu commands, click on an item in the menu bar to open a menu, click on a menu
item to execute a command or open a submenu or dialog box.
Command Language
To be able to type commands in the Session window, you must obtain the “MTB>”
prompt. All commands are then entered after the “MTB>” prompt. All command lines
are free format, in other words, all text may be entered in upper or lower case letters
anywhere in the line.
NOTE: This guide focuses mainly on using the Built-In routines. All the explanations
and examples that follow will be done using Minitab’s Built-In routines. A brief
introduction to using Minitab commands is found in section.
INFERENTIAL STATISTICS
a. Confidence Intervals:
i. 1-Sample Z: Stat> Basic Statistics> 1-sample Z >check the alpha level in options.
ii. 1-Sample t: Stat> Basic Statistics> 1-sample t >check the alpha level in options.
b. Hypothesis Testing:
i. 1-Sample Z: Stat> Basic Statistics> 1-sample Z >check the alpha level and alternative
hypothesis in options
ii. 1-Sample t: Stat> Basic Statistics> 1-sample t>check the alpha level and alternative
hypothesis in options
105
Point and interval estimation
2. Select the Stat menu, highlight Basic Statistics, then click 1-Sample Z . . .
3. If you have raw data, enter C1 in the cell marked “Samples in columns:”. If you have
summarized data, select the summarized data radio button and enter the summarized
values. Select Options and enter a confidence level. Click OK. In the cell marked
standard deviation, enter the value of. Click OK.
Confidence Intervals about μ, Unknown
1. If you have raw data, enter them in column C1.
2. Select the Stat menu; highlight Basic Statistics, then highlight 1-Sample t . . .
3. If you have raw data, enter C1 in the cell marked “Samples in columns”. In you have
summarized data, select the “summarized data” radio button and enter the summarized
data. Select Options . . . and enter a confidence level. Click OK twice.
Confidence Intervals about p
1. If you have raw data, enter the data in column C1.
2. Select the Stat menu, highlight Basic Statistics, and then highlight 1 Proportion . . .
3. Enter C1 in the cell marked “samples in Columns” if you have raw data. If you have
summary statistics; Click “Summarized data” and enter the number of trials, n, and the
number of events (successes) x.
4. Click the Options . . . button. Enter a confidence level. Click “Use test based on a
normal distribution” (provided that the assumptions stated are satisfied). Click OK
twice.
Confidence Intervals about σ2
1. Enter raw data in column C1
2. Select the Stat menu, highlight Basic Statistics, and then highlight Graphical
Summary . . .
3. Enter C1 in the cell marked “Variables.”
106
4. Enter the confidence level desired. Click OK. The confidence interval for sigma is
reported in the output.
Testing of hypothesis about one population mean and proportion
2. Select the Stat menu, highlight Basic Statistics, and then highlight 1-Sample Z . . .
3. Click Options. In the cell marked “Alternative,” select the appropriate direction for
the alternative hypothesis. Click OK.
Hypothesis Tests Regarding a Population Proportion
1. If you have raw data, enter them in C1, using 0 for failure and 1 for success.
2. Select the Stat menu, highlight Basic Statistics, then highlight 1-Proportion.
3. If you have raw data, select the “Samples in columns” radio button and enter C1. If
you have summarized statistics, select “Summarized data.” Enter the number of trials
and the number of successes.
4. Click Options. Enter the value of the proportion stated in the null hypothesis. Enter
the direction of the alternative hypothesis. If ( ) , check the box marked “Use test and
interval based on normal distribution.” Click OK twice.
Hypothesis Tests Regarding a Population Standard Deviation
1. Enter the raw data into column C1 if necessary. Select the Stat menu, highlight
Basic Statistics, and then highlight 1 Variance.
2. Make sure the pull-down menu has “Enter standard deviation” in the window. If you
have raw data, enter C1 in the window marked “Samples in columns” and make sure
the radio button is selected. If you have summarized data, select the “Summarized data”
radio button and enter the sample size and sample standard deviation.
3. Click Options and select the direction of the alternative hypothesis. Click OK.
4. Check the “Perform hypothesis test” box and enter the value of the standard
deviation in the null hypothesis. Click OK.
107
Comparisons of two population means and proportions
MINITAB will calculate the test value (statistics) and p-value for difference between
the means for two populations when the population standard deviation is unknown.
108
Calculates the value of the Chi-square (4) density curve at each value in C1 and stores
these values in C2. This is useful for plotting the density curve. The Calc I Probability
Distributions I Chi-Square command or the session commands cdf and invcdf, can also
be used to obtain values of the Chi-square (k) cumulative distribution function and
inverse distribution function, respectively. We use the Calc I Random Data I Chi-
Square command, or the session command random, to obtain random samples from
these distributions.
We will see applications of the chi-square distribution later in the book but we mention
one here. In particular, if x1. . . xn is a sample from a N(μ, σ)distribution, then (n − 1)
s2/σ2 =Pni=1 (xi − ¯x)2 /σ2 is known to follow a Chi-square(n − 1) distribution, and
this fact is used as a basis for inference about σ (confidence intervals and tests of
significance). Because of the non robustness of these inferences to small deviations
from normality, these inferences are not recommended.
Correlations
While a scatter plot is a convenient graphical method for assessing whether or not there
is any relationship between two variables, we would also like to assess this numerically.
The coefficient provides a numerical summarization of the degree to which a linear
relationship exists between two quantitative variables, and this can be calculated using
the Stat I Basic Statistics. I Correlation command.
Correlate E1 . . . Em
Where E1... Em are columns corresponding to numerical variables, and a correlation
coefficient is computed between each pair. This gives m(m − 1)/2correlation
coefficients. The subcommand no p -values is available if you want to sup press the
printing of P-values.
1. With the explanatory variable in C1 and the response variable in C2, select the Stat
menu and highlight Basic Statistics. Highlight Correlation.
2. Select the variables whose correlation you wish to determine and click OK.
Choose CORRELATION and obtain the following dialog box. Choose the pair
of variables to be analyzed.
109
2. Results are displayed in the Session window as presented below
110
The following dialog box appears
This is basically a calculator that allows doing many calculations with the variables.
Basic functions are found in the number pad and more sophisticated ones are found in
the functions box to the right of the number pad.
To make sure that your results is not over writing a variable, name a new variable in the
“STORERESULTSIN VAVRIABLE” field in the top of the calculator.
a. Adding variables
1. To add variables name the variable where you want to store the results.
2. Select the first variable, press the “+”sign and select the second variable (and so on
for more than two variables).You should obtain something similar to the window in the
below
111
3. The result will then be shown in the worksheet window
Taking logarithms
112
Logical functions
Some statistical analysis will need to separate by groups according to characteristics
that are contained in the data. Logical functions are particularly useful in these cases. A
simple example on how to use them is described below.
1. Choose the variable you want to do the logical test to. Here we are looking at the
“SEX” variable.
2. Choose the logical test you want to use. Here we want to see which observations
have the variable “SEX” equal 1. That is, which observations are males?
3. Make sure that you have indicated a variable in which to store your results, by
typing the name of your result variable in the “STORE RESULT IN VARIABLE”
box.
4. The result variable will be a binary variable (variable of 1s and 0s) where 1
indicates the logical testis true and0the test is false. The result variable will appear in
the Worksheet window.
113
Determining the Least-Squares Regression Line
Regression is another technique for assessing the strength of a linear relationship
existing between two variables and it is closely related to correlation. For this, we use
the Stat I Regression command.
As noted in IPS, the regression analysis of two quantitative variables involves
computing the least-squares line y = a + bx, where one variable is taken to bet he
response variable y and the other is taken to be the explanatory variable x.
It is very convenient to have a scatter plot of the points together with the least-squares
line. This can be accomplished using the Stat I Regression I Fitted Line Plot command.
1. With the explanatory variable in C1 and the response variable in C2, select the Stat
menu and highlight Regression. Highlight Regression . . . .
2. Select the explanatory (predictor) variable and response variable and click OK.
The Coefficient and Determination,
This is provided in the standard regression output
Residual Plots
Follow the same steps as those used to obtain the regression output (Section 4.2).
Before selecting OK, click GRAPHS. In the cell that says “Residuals versus the
variables,” enter the name of the explanatory variable. Click OK
Simulation
1. Set the seed by selecting the Calc menu and highlighting Set Base . . . Insert any
seed you with into the cell and click OK.
2. Select the Calc menu, highlight Random Data, and then highlight Integer.
3. Select the Stat menu, highlight Tables, and then highlight Tally . . . Enter C1 into
the variables cell. Make sure that the Counts box is checked and click OK.
The chi-squared (2) test statistics is widely used in the analysis of contingency
tables.
114
The chi-square measures test the hypothesis that the row and column variables in
a cross tabulation are independent.
It compares the actual observed frequency in each group with the expected
frequency (the latter is based on theory, experience or comparison groups).
The chi-squared test (Pearson’s χ2) allows us to test for association between
categorical (nominal!) variables.
The null hypothesis for this test is there is no association between the variables.
Consequently a significant p-value implies association.
After opening the Crosstabs dialog box as described in the preceding section, click the
Statistics button to get the following dialog box:
Variable A
A1 A2 Total
B1 a b a+b
Variable B B2 c d c+d
Total a+c b+d n
Test Statistic: 2-test Test Statistic: 2-test for 2 x 2 Contingency table
nad bc
2
2
( a c)(b d )(a b)(c d )
115
Test Statistic: 2-test with d.f. = (r-1)x(c-1)
O Eij 2
2
i, j
ij
Eij
116
Test for a relationship between two categorical variables
i n
O Ei 2
=
i 1
i
Ei
Frat or sorority ?
y es no Tot al
Ev er - Depress ion y es Count 681 7692 8373
Expec ted Count 715.6 7657.4 8373.0
no Count 3744 39657 43401
Expec ted Count 3709.4 39691.6 43401.0
Tot al Count 4425 47349 51774
Expec ted Count 4425.0 47349.0 51774.0
117
T-tests
The t test is a useful technique for comparing mean values of two sets of
numbers
The comparison will provide you with a statistic for evaluating whether the
difference between two means is statistically significant
T tests can be used either to compare independent-samples t test or paired-
samples t test
There are three types of t tests; the options are all located under the Analyze
menu item
Analyze
Compare Means
One-Sample T test...
Independent-Samples T test...
Paired-Samples T test...
One-Sample T- Test:
Hypotheses
Ho : µ = 5
HA: µ ≠ 5
Test: Two-tailed t-test
Result: Reject
One-Sampl e Statistics
Std. Error
N Mean Std. Dev iation Mean
How many drinks 53374 4. 42 4. 401 .019
118
One-Sampl e Test
Tes t Value = 5
95% Conf idence
Interv al of the
Mean Dif f erence
t df Sig. (2-t ailed) Dif f erence Lower Upper
How many drinks -30.352 53373 .000 -. 578 -. 62 -. 54
Example: Men and women report significantly different numbers of sexual partners
over the past 12 months
Hypotheses
µ1 = µ2
µ1 ≠ µ2
Test: Independent Samples t-test OR One-way ANOVA
Result: Reject null
Group Statistics
Std. Error
Sex N Mean Std. Dev iation Mean
Part ners y ou had f emale 32687 1. 34 2. 017 .011
male 18474 1. 82 3. 627 .027
119
Independent Samples Test
Group Statistics
Std. Error
gender N Mean Std. Dev iation Mean
v erbal f luenc y - animal f emale 855 15. 24 5. 711 .195
naming score male 580 15. 95 5. 493 .228
The group statistics tells us the mean of animal naming score among males and
females
The t-test is a test that tells us the mean difference of animal naming score among
males and females, is statistically significant.
120
The paired-sample t test:
It compares the means of two variables that represent the same group at different
times (e.g. before and after an event) or related groups (e.g., husbands and
wives).
In paired sample T test dialog box we have to choose two variables from the
left side box to paired variable box
Note: By clicking on Option button we can specify whatever we want
The Paired-Samples T Test procedure compares the means of two variables for a single
group. It computes the differences between values of the two variables for each case
and tests whether the average differs from 0.
Analysis of Variance
The One-Way ANOVA compares the mean of one or more groups based on one
independent variable (or factor)
It measures differences among group means.
In SPSS can be performed as:
Analyze
Compare Means
One-Way ANOVA
121
Move all dependent variables into the box labeled "Dependent List"
Move the independent variable into the box labeled "Factor"
Click on the button labeled "Options"
Check off the boxes for Descriptive and Homogeneity of Variance
Click on the box marked "Post Hoc" and choose the appropriate post hoc
comparison
The two groups have approximately equal variance on the dependent
variable. You can check this by looking at the Levene's Test
If Levene's statistic is significant, we have evidence that the homogeneity
assumption has been violated.
If it is a problem, you can re-run the analysis selecting the option for "Equal
Variances Not Assumed"
Hypotheses:
Null: There are no significant differences between the groups' mean
scores.
Alternate: There is a significant difference between the groups'
mean scores.
Steps for one-way ANOVA
HA: μi μj for i j
Step 3: Computation of the test statistics
122
Steps can be summarized in to ANOVA-table as:
Levene's Test
If this happen, you should re-run the analysis selecting the option for
"Equal Variances Not Assumed"
Multiple comparisons
If the mean are significantly different (reject Ho),we are interested in which pair
of mean are different. Consequently, we should use method called multiple
comparisons.
123
Ho: Pair of treatment mean is equal (μi=μj for i j)
Reject Ho if p-value < 0.05 or zero is not included in the confidence interval
To do this in SPSS, click post Hoc button and select method based on equal
variance assumed or not (For this see Levene test of homogeneity of variance)
Hypotheses
µ1 = µ2 = µ3 =µ4 = µ5 = µ6
Descriptives
124
ANOVA
It can be used to measure the degree of the association between the two variables.
By selecting this menu item, you will see that there are three options for correlating
variables:
Bivariate,
Partial, and
Distances
The bivariate correlation is for situations where you are interested only in the
relationship between two variables
Analyze
Correlate
Bivariate...
Drag the necessary variables to Variables dialog box
The partial correlation measures an association between two variables with the
effects of one or more other variables factored out
125
Analyze
Correlate
Partial...
Hypotheses
Ho: ρ = 0
HA: ρ ≠ 0
Co r re lati on s
126
Partial Correlation in SPSS
The partial correlation measures the strength association between two variables by
controlling the effects of one or more other variables. For example: current and
beginning salary by controlling effect of previous experience
– Analyze
> Correlate
Partial...
Example: Let us compare the strength of relationship between current salary and
beginning salary, after controlling the effect of previous experience.
Fitting a simple linear regression model to the data allows us to explain or predict
the values of one variable (the dependent or outcome or response variable or y)
given the values of a second variable (called the independent or exposure or
explanatory variable or x).
The basic idea of simple linear regression is to find the straight line which best
fits the data.
127
For example if we are interested in predicting under-five mortality rate from percentage
of children immunized against DPT we would treat immunization as independent
variable and mortality rate as dependent variable.
y = a + bx.
To conduct a regression analysis, select the following from the Analyze menu
Analyze
Regression
Linear...
This will produce the following dialog box:
128
R is the multiple correlation coefficient between all of the predictor variables and the
dependent variable.
Select the dependent variable to the ‘dependent’ space and the independent
variable to the ‘independent’.
After Clicking the ‘statistics’, chose the ‘estimate’, ‘model fit’, ‘confidence
interval’ and ‘R squared change’ and click the ‘Ok’.
This will give you the mean difference between and within group
difference and its significance is measured using F-test.
It also gives you Regression coefficients (the intercept and the slop)
129
After clicking the ‘statistics’
‘Estimate’,
‘Model fit’,
OUTPUT
Model Summary
The Model summary shows you the R2 which tells us how many the predictive
Variables explains outcome variable, here in this example, it is 3.7 %.
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regress ion 1578.905 1 1578.905 52. 271 .000a
Res idual 40597.181 1344 30. 206
Tot al 42176.086 1345
a. Predic tors : (Const ant ), marit al st atus
b. Dependent Variable: v erbal f luency - anim al nam ing score
130
ANOVA statistics also tells us whether the explanatory variable predicts the outcome
variable well using F-test.
Coefficientsa
Unstandardized Standardized
Coeff icients Coeff icients 95% Conf idence Interv al for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 17.779 .344 51.718 .000 17.105 18.454
marital status -.808 .112 -.193 -7.230 .000 -1.027 -.589
a. Dependent Variable: verbal fluency - animal naming score
Standard coefficient may be useful and gives a good estimate through relative
estimation using standard deviation.
Students’t-test is the statistics that estimates the significance, and the upper and
lower 95% CI, are significant if both become Negative or Positive.
Linearity - the relationships between the predictors and the outcome variable
should be linear
Normality - the errors should be normally distributed - technically normality is
necessary only for the t-tests to be valid, estimation of the coefficients only
requires that the errors be identically and independently distributed
131
Homogeneity of variance (homoscedasticity) - the error variance should be
constant
Independence - the errors associated with one observation are not correlated
with the errors of any other observation
Model specification - the model should be properly specified (including all
relevant variables, and excluding irrelevant variables)
Additionally, there are issues that can arise during the analysis that, while strictly
speaking are not assumptions of regression, are none the less, of great concern to
regression analysts.
Many graphical methods and numerical tests have been developed over the years for
regression diagnostics and SPSS makes many of these methods easy to access and use.
In this chapter, we will explore these methods and show how to verify regression
assumptions and detect potential problems using SPSS.
A single observation that is substantially different from all other observations can make
a large difference in the results of your regression analysis. If a single observation (or
small group of observations) substantially changes your results, you would want to
know about this and investigate further. There are three ways that an observation can
be unusual.
132
Leverage: An observation with an extreme value on a predictor variable is called a
point with high leverage. Leverage is a measure of how far an observation deviates
from the mean of that variable. These leverage points can have an unusually large effect
on the estimate of regression coefficients.
How can we identify these three types of observations? Let's look at an example dataset
called crime. This dataset appears in Statistical Methods for Social Sciences, Third
Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are
state id (sid), state name (state), violent crimes per 100,000 people (crime), murders
per 1,000,000 (murder), the percent of the population living in metropolitan areas
(pctmetro), the percent of the population that is white (pctwhite), percent of
population with a high school education or above (pcths), percent of population living
under poverty line (poverty), and percent of population that are single parents
(single). Below we read in the file and do some descriptive statistics on these
variables. You can click crime.sav to access this file, or see the Regression with SPSS
page to download all of the data files used in this book.
Descriptive Statistics
133
PCTWHITE 51 31.80 98.50 84.1157 13.25839
Valid N (listwise) 51
Let's say that we want to predict crime by pctmetro, poverty, and single. That is to
say, we want to build a linear regression model between the response variable crime
and the independent variables pctmetro, poverty and single. We will first look at the
scatter plots of crime against each of the predictor variables before the regression
analysis so we will have some ideas about potential problems. We can create a scatter
plot matrix of these variables as shown below.
graph
/scatter plot(matrix)=crime murder pctmetro pctwhite pcths poverty
single .
134
The graphs of crime with other variables show some potential problems. In every plot,
we see a data point that is far away from the rest of the data points. Let's make
individual graphs of crime with pctmetro and poverty and single so we can get a
better view of these scatterplots. We will use BY state (name) to plot the state name
instead of a point
135
GRAPH /SCATTER PLOT(BIVAR)=single WITH crime BY state(name) .
All the scatter plots suggest that the observation for state = "dc" is a point that requires
extra attention since it stands out away from all of the other points. We will keep it in
mind when we do our regression analysis.
Now let's try the regression command predicting crime from pctmetro poverty and
single. We will go step-by-step to identify all the potentially unusual or influential
points afterwards.
regression
/dependent crime
/method=enter pctmetro poverty single.
Variables Entered/Removed(b)
136
Model Summary(b)
Model R R Square Adjusted R Square Std. Error of the Estimate
ANOVA(b)
Total 9728474.745 50
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
137
Let's examine the standardized residuals as a first means for identifying outliers. Below
we use the /residuals=histogram subcommand to request a histogram for the
standardized residuals. As you see, we get the standard output that we got above, as
well as a table with information about the smallest and largest residuals, and a
histogram of the standardized residuals. The histogram indicates a couple of extreme
residuals worthy of investigation.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram.
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
Model Summary(b)
ANOVA(b)
Sum of
Model df Mean Square F Sig.
Squares
Regression 8170480.211 3 2723493.404 82.160 .000(a)
1 Residual 1557994.534 47 33148.820
Total 9728474.745 50
a Predictors: (Constant), SINGLE, PCTMETRO, POVERTY
138
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients t Sig.
Model B Std. Error Beta
Residuals Statistics(a)
139
Let's now request the same kind of information, except for the studentized deleted
residual. The studentized deleted residual is the residual that would be obtained if the
regression was re-run omitting that observation from the analysis. This is useful
because some points are so influential that when they are included in the analysis they
can pull the regression line close to that observation making it appear as though it is not
an outlier -- however when the observation is deleted it then becomes more obvious
how outlying it is. To save space, below we show just the output related to the residual
analysis.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid).
Residuals Statistics(a)
140
The histogram shows some possible outliers. We can use the outliers (sdresid) and
id(state) options to request the 10 most extreme values for the studentized deleted
residual to be displayed labeled by the state from which the observation
originated. Below we show the output generated by this option, omitting all of the rest
of the output to save space. You can see that "dc" has the largest value (3.766)
followed by "ms" (-3.571) and "fl" (2.620).
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram (sdresid) id (state) outliers (sdresid).
Outlier Statistics(a)
1 51 dc 3.766
2 25 ms -3.571
3 9 fl 2.620
Stud. Deleted Residual 4 18 la -1.839
5 39 ri -1.686
6 12 ia 1.590
7 47 wa -1.304
141
8 13 id 1.293
9 14 il 1.152
10 35 oh -1.148
a Dependent Variable: CRIME
We can use the /casewise subcommand below to request a display of all observations
where the sdresid exceeds 2. To save space, we show just the new output generated by
the /casewise subcommand. This shows us that Florida, Mississippi and Washington
DC have sdresid values exceeding 2.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid) id(state) outliers(sdresid)
/casewise=plot(sdresid) outliers(2) .
Casewise Diagnostics(a)
Now let's look at the leverage values to identify observations that will have potential
great influence on regression coefficient estimates. We can include lever with the
histogram ( ) and the outliers ( ) options to get more information about observations
with high leverage. We show just the new output generated by these additional
subcommands below. Generally, a point with leverage greater than (2k+2)/n should be
carefully examined. Here k is the number of predictors and n is the number of
observations, so a value exceeding (2*3+2)/51 = .1568 would be worthy of further
142
investigation. As you see, there are 4 observations that have leverage values higher
than .1568.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid lever)
/casewise=plot(sdresid) outliers(2).
Outlier Statistics(a)
1 51 dc 3.766
2 25 ms -3.571
3 9 fl 2.620
4 18 la -1.839
5 39 ri -1.686
Stud. Deleted Residual
6 12 ia 1.590
7 47 wa -1.304
8 13 id 1.293
9 14 il 1.152
10 35 oh -1.148
1 51 dc .517
2 1 ak .241
3 25 ms .171
4 49 wv .161
5 18 la .146
Centered Leverage Value
6 46 vt .117
7 9 fl .083
8 26 mt .080
9 31 nj .075
10 17 ky .072
a Dependent Variable: CRIME
143
As we have seen, DC is an observation that both has a large residual and large
leverage. Such points are potentially the most influential. We can make a plot that
shows the leverage by the residual and look for observations that are high in leverage
and have a high residual. We can do this using the /scatter plot subcommand as shown
below. This is a quick way of checking potential influential observations and outliers at
the same time. Both types of points are of great concern for us. As we see, "dc" is both
a high residual and high leverage point, and "ms" has an extremely negative residual
but does not have such a high leverage.
144
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever)
/casewise=plot(sdresid) outliers(2)
/scatterplot(*lever, *sdresid).
Now let's move on to overall measures of influence, specifically let's look at Cook's D,
which combines information on the residual and leverage. The lowest value that Cook's
D can assume is zero, and the higher the Cook's D is, the more influential the point is.
The conventional cut-off point is 4/n, or in this case 4/51 or .078. Below we add the
cook keyword to the outliers( ) option and also on the /casewise subcommand and
below we see that for the 3 outliers flagged in the "Casewise Diagnostics" table, the
value of Cook's D exceeds this cutoff. And, in the "Outlier Statistics" table, we see that
"dc", "ms", "fl" and "la" are the 4 states that exceed this cutoff, all others falling below
this threshold.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid).
145
Casewise Diagnostics(a)
Outlier Statistics(a)
1 51 dc 3.766
2 25 ms -3.571
3 9 fl 2.620
4 18 la -1.839
5 39 ri -1.686
Stud. Deleted Residual
6 12 ia 1.590
7 47 wa -1.304
8 13 id 1.293
9 14 il 1.152
10 35 oh -1.148
1 51 dc 3.203 .021
2 25 ms .602 .663
3 9 fl .174 .951
4 18 la .159 .958
5 39 ri .041 .997
Cook's Distance
6 12 ia .041 .997
7 13 id .037 .997
8 20 md .020 .999
9 6 co .018 .999
10 49 wv .016 .999
Centered Leverage Value 1 51 dc .517
146
2 1 ak .241
3 25 ms .171
4 49 wv .161
5 18 la .146
6 46 vt .117
7 9 fl .083
8 26 mt .080
9 31 nj .075
10 17 ky .072
a Dependent Variable: CRIME
Cook's D can be thought of as a general measure of influence. You can also consider
more specific measures of influence that assess how each coefficient is changed by
including the observation. Imagine that you compute the regression coefficients for the
regression model with a particular case excluded, then re compute the model with the
case included, and you observe the change in the regression coefficients due to
including that case in the model. This measure is called DFBETA and a DFBETA
value can be computed for each observation for each predictor. As shown below, we
use the /save sdbeta (sdbf) subcommand to save the DFBETA values for each of the
predictors. This saves 4 variables into the current data file, sdfb1, sdfb2, sdfb3 and
sdfb4, corresponding to the DFBETA for the Intercept and for pctmetro, poverty and
for single, respectively. We could replace sdfb with anything we like, and the
variables created would start with the prefix that we provide.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid)
/save sdbeta(sdfb).
The /save sdbeta (sdfb) subcommand does not produce any new output, but we can see
the variables it created for the first 10 cases using the list command below. For
147
example, by including the case for "ak" in the regression analysis (as compared to
excluding this case), the coefficient for pctmetro would decrease by -.106 standard
errors. Likewise, by including the case for "ak" the coefficient for poverty decreases
by -.131 standard errors, and the coefficient for single increases by .145 standard errors
(as compared to a model excluding "ak"). Since the inclusion of an observation could
either contribute to an increase or decrease in a regression coefficient, DFBETAs can
be either positive or negative. A DFBETA value in excess of 2/sqrt(n) merits further
investigation. In this example, we would be concerned about absolute values in excess
of 2/sqrt(51) or .28.
list
/variables state sdfb1 sdfb2 sdfb3
/cases from 1 to 10.
STATE SDFB1 SDFB2 SDFB3
ak -.10618 -.13134 .14518
al .01243 .05529 -.02751
ar -.06875 .17535 -.10526
az -.09476 -.03088 .00124
ca .01264 .00880 -.00364
co -.03705 .19393 -.13846
ct -.12016 .07446 .03017
de .00558 -.01143 .00519
fl .64175 .59593 -.56060
ga .03171 .06426 -.09120
Number of cases read: 10 Number of cases listed: 10
We can plot all three DFBETA values for the 3 coefficients against the state id in one
graph shown below to help us see potentially troublesome observations. We see
changed the value labels for sdfb1sdfb2 and sdfb3 so they would be shorter and more
clearly labeled in the graph. We can see that the DFBETA for single for "dc" is about
3, indicating that by including "dc" in the regression model, the coefficient for single is
148
3 standard errors larger than it would have been if "dc" had been omitted. This is yet
another bit of evidence that the observation for "dc" is very problematic.
The following table summarizes the general rules of thumb we use for the measures we
have discussed for identifying observations worthy of further investigation (where k is
the number of predictors and n is the number of observations).
Measure Value
leverage >(2k+2)/n
abs(rstu) >2
Cook's D > 4/n
abs(DFBETA) > 2/sqrt(n)
149
We have shown a few examples of the variables that you can refer to in the /residuals,
/casewise, /scatterplot and /save sdbeta ( ) subcommands. Here is a list of all of the
variables that can be used on these subcommands; however, not all variables can be
used on each subcommand.
150
In addition to the numerical measures we have shown above, there are also several
graphs that can be used to search for unusual and influential observations. The partial-
regression plot is very useful in identifying influential points. For example below we
add the /partial plot subcommand to produce partial-regression plots for all of the
predictors. For example, in the 3rd plot below you can see the partial-regression plot
showing crime by single after both crime and single have been adjusted for all other
predictors in the model. The line plotted has the same slope as the coefficient for
single. This plot shows how the observation for DC influences the coefficient. You
can see how the regression line is tugged upwards trying to fit through the extreme
value of DC. Alaska and West Virginia may also exert substantial leverage on the
coefficient of single as well. These plots are useful for seeing how a single point may
be influencing the regression line, while taking other variables in the model into
account.
Note that the regression line is not automatically produced in the graph. We double
clicked on the graph, and then chose "Chart" and the "Options" and then chose "Fit
Line Total" to add a regression line to each of the graphs below.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram (sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid)
/partialplot.
151
DC has appeared as an outlier as well as an influential point in every analysis. Since
DC is really not a state, we can use this to justify omitting it from the analysis saying
that we really wish to just analyze states. First, let's repeat our analysis including DC
below.
regression
/dependent crime
/method=enter pctmetro poverty single.
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients t Sig.
Model B Std. Error Beta
(Constant) -1666.436 147.852 -11.271 .000
PCTMETRO 7.829 1.255 .390 6.240 .000
1
POVERTY 17.680 6.941 .184 2.547 .014
SINGLE 132.408 15.503 .637 8.541 .000
a Dependent Variable: CRIME
152
Now, let's run the analysis omitting DC by using the filter command to omit "dc" from
the analysis. As we expect, deleting DC made a large change in the coefficient for
single .The coefficient for single dropped from 132.4 to 89.4. After having deleted DC,
we would repeat the process we have illustrated in this section to search for any other
outlying and influential observations.
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients t Sig.
Model B Std. Error Beta
(Constant) -1197.538 180.487 -6.635 .000
PCTMETRO 7.712 1.109 .565 6.953 .000
1
POVERTY 18.283 6.136 .265 2.980 .005
SINGLE 89.401 17.836 .446 5.012 .000
a Dependent Variable: CRIME
Summary
In this section, we explored a number of methods of identifying outliers and influential
points. In a typical analysis, you would probably use only some of these
methods. Generally speaking, there are two types of methods for assessing outliers:
statistics such as residuals, leverage, and Cook's D, which assess the overall impact of
an observation on the regression results, and statistics such as DFBETA that assess the
specific impact of an observation on the regression coefficients. In our example, we
found out that DC was a point of major concern. We performed a regression with it
153
and without it and the regression equations were very different. We can justify
removing it from our analysis by reasoning that our model is to predict crime rate for
states not for metropolitan areas.
One of the assumptions of linear regression analysis is that the residuals are normally
distributed. It is important to meet this assumption for the p-values for the t-tests to be
valid. Let's use the elemapi2 data file we saw in Chapter 1 for these analyses. Let's
predict academic performance (api00) from percent receiving free meals (meals),
percent of English language learners (ell), and percent of teachers with emergency
credentials (emer). We then use the /save command to generate residuals.
get file="c:\spssreg\elemapi2.sav".
regression
/dependent api00
/method=enter meals ell emer
/save resid(apires).
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
1 EMER, ELL, MEALS(a) . Enter
a All requested variables entered.
b Dependent Variable: API00
Model Summary(b)
Std. Error of
Model R R Square Adjusted R Square
the Estimate
154
ANOVA(b)
Sum of Mean
Model df F Sig.
Squares Square
Regression 6749782.747 3 2249927.582 672.995 .000(a)
1 Residual 1323889.251 396 3343.155
Total 8073671.997 399
a Predictors: (Constant), EMER, ELL, MEALS
b Dependent Variable: API00
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
Casewise Diagnostics(a)
93 3.087 604
155
Residuals Statistics(a)
We now use the examine command to look at the normality of these residuals. All of
the results from the examine command suggest that the residuals are normally
distributed -- the skewness and kurtosis are near 0, the "tests of normality" are not
significant, the histogram looks normal, and the Q-Q plot looks normal. Based on these
results, the residuals from this regression appear to conform to the assumption of being
normally distributed.
examine
variables=apires
/plot boxplot stemleaf histogram npplot.
Case Processing Summary
Cases
Descriptive
156
Statistic Std. Error
Tests of Normality
Kolmogorov-Smirnov(a) Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
APIRES .033 400 .200(*) .996 400 .510
* This is a lower bound of the true significance.
a Lilliefors Significance Correction
157
Unstandardized Residual Stem-and-Leaf Plot
Frequency Stem & Leaf
1.00 Extremes (=<-185)
2.00 -1 . 4
3.00 -1 . 2&
7.00 -1 . 000
15.00 -0 . 8888899
35.00 -0 . 66666666667777777
37.00 -0 . 444444444555555555
49.00 -0 . 222222222222223333333333
61.00 -0 . 000000000000000011111111111111
48.00 0 . 000000111111111111111111
49.00 0 . 222222222222233333333333
28.00 0 . 4444445555555
31.00 0 . 666666666677777
16.00 0 . 88888899
9.00 1 . 0011
3.00 1 . 2&
1.00 1 .&
5.00 Extremes (>=152)
Stem width: 100.0000
Each leaf: 2 case(s) & denotes fractional leaves.
158
Heteroscedasticity
Another assumption of ordinary least squares regression is that the variance of the
residuals is homogeneous across levels of the predicted values, also known as
homoscedasticity. If the model is well-fitted, there should be no pattern to the residuals
plotted against the fitted values. If the variance of the residuals is non-constant then the
residual variance is said to be "heteroscedastic." Below we illustrate graphical methods
for detecting heteroscedasticity. A commonly used graphical method is to use the
residual versus fitted plot to show the residuals versus fitted (predicted) values. Below
we use the /scatter plot subcommand to plot *zresid (standardized residuals) by *pred
(the predicted values). We see that the pattern of the data points is getting a little
narrower towards the right end, an indication of mild heteroscedasticity.
159
regression
/dependent api00
/method=enter meals ell emer
/scatterplot(*zresid *pred).
Let's run a model where we include just enroll as a predictor and show the residual vs.
predicted plot. As you can see, this plot shows serious heteroscedasticity. The
variability of the residuals when the predicted value is around 700 is much larger than
when the predicted value is 600 or when the predicted value is 500.
regression
/dependent api00
/method=enter enroll
/scatterplot(*zresid *pred).
160
As we saw in Chapter 1, the variable enroll was skewed considerably to the right, and
we found that by taking a log transformation, the transformed variable was more
normally distributed. Below we transform enroll, run the regression and show the
residual versus fitted plot. The distribution of the residuals is much
improved. Certainly, this is not a perfect distribution of residuals, but it is much better
than the distribution with the untransformed variable.
computelenroll = ln(enroll).
regression
/dependent api00
/method=enter lenroll
/scatterplot(*zresid *pred).
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
1 LENROLL(a) . Enter
a. All requested variables entered.
b. Dependent Variable: API00
Model Summary(b)
ANOVA(b)
Model Sum of Squares df Mean Square F Sig.
Regression 609460.408 1 609460.408 32.497 .000(a)
1 Residual 7464211.589 398 18754.300
Total 8073671.997 399
a Predictors: (Constant), LENROLL
b Dependent Variable: API00
161
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) 1170.429 91.966 12.727 .000
1
LENROLL -86.000 15.086 -.275 -5.701 .000
a Dependent Variable: API00
Residuals Statistics(a)
Std. Predicted
-2.816 2.666 .000 1.000 400
Value
162
Finally, let's revisit the model we used at the start of this section, predicting api00 from
meals, ell and emer. Using this model, the distribution of the residuals looked very
nice and even across the fitted values. What if we add enroll to this model. Will this
automatically ruin the distribution of the residuals? Let's add it and see.
regression
/dependent api00
/method=enter meals ell emer enroll
/scatterplot(*zresid *pred).
Variables Entered/Removed(b)
Variables
Model Variables Entered Method
Removed
ENROLL, MEALS, EMER,
1 . Enter
ELL(a)
a All requested variables entered.
b Dependent Variable: API00
Model Summary(b)
R Adjusted R Std. Error of the
Model R
Square Square Estimate
1 .915(a) .838 .836 57.552
a Predictors: (Constant), ENROLL, MEALS, EMER, ELL
b Dependent Variable: API00
ANOVA(b)
Sum of Mean
Model df F Sig.
Squares Square
Regression 6765344.050 4 1691336.012 510.635 .000(a)
1 Residual 1308327.948 395 3312.223
Total 8073671.997 399
a Predictors: (Constant), ENROLL, MEALS, EMER, ELL
b Dependent Variable: API00
163
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) 899.147 8.472 106.128 .000
MEALS -3.222 .152 -.723 -21.223 .000
ELL -.768 .195 -.134 -3.934 .000
1
EMER -1.418 .300 -.117 -4.721 .000
-3.126E-
ENROLL .014 -.050 -2.168 .031
02
a Dependent Variable: API00
Casewise Diagnostics(a)
93 3.004 604
Residuals Statistics(a)
Std. Predicted
-1.665 1.847 .000 1.000 400
Value
164
As you can see, the distribution of the residuals looks fine, even after we added the
variable enroll. When we had just the variable enroll in the model, we did a log
transformation to improve the distribution of the residuals, but when enroll was part of
a model with other variables, the residuals looked good so no transformation was
needed. This illustrates how the distribution of the residuals, not the distribution of the
predictor, was the guiding factor in determining whether a transformation was needed.
7.3.3.3. Collinearity
When there is a perfect linear relationship among the predictors, the estimates for a
regression model cannot be uniquely computed. The term collinearity implies that two
variables are near perfect linear combinations of one another. When more than two
variables are involved it is often called multicollinearity, although the two terms are
often used interchangeably.
The primary concern is that as the degree of multicollinearity increases, the regression
model estimates of the coefficients become unstable and the standard errors for the
coefficients can get wildly inflated. In this section, we will explore some SPSS
commands that help to detect multicollinearity.
We can use the /statistics=defaults tol to request the display of "tolerance" and "VIF"
values for each predictor as a check for multicollinearity. The "tolerance" is an
indication of the percent of variance in the predictor that cannot be accounted for by the
other predictors, hence very small values indicate that a predictor is redundant, and
165
values that are less than .10 may merit further investigation. The VIF, which stands for
variance inflation factor, is (1 / tolerance) and as a rule of thumb, a variable whose VIF
values is greater than 10 may merit further investigation. Let's first look at the
regression we did from the last section, the regression model predicting api00 from
meals, ell and emer using the /statistics=defaults tol subcommand. As you can see,
the "tolerance" and "VIF" values are all quite acceptable.
regression
/statistics=defaults tol
/dependent api00
/method=enter meals ell emer .
<some output deleted to save space>
Coefficients(a)
Now let's consider another example where the "tolerance" and "VIF" values are more
worrisome. In the regression analysis below, we use acs_k3 avg_edgrad_schcol_grad
and some_col as predictors of api00. As you see, the "tolerance" values for
avg_edgrad_sch and col_grad are below .10, and avg_ed is about 0.02, indicating that
only about 2% of the variance in avg_ed is not predictable given the other predictors in
the model. All of these variables measure education of the parents and the very low
"tolerance" values indicate that these variables contain redundant information. For
166
example, after you know grad_sch and col_grad, you probably can predict avg_ed
very well. In this example, multicollinearity arises because we have put in too many
variables that measure the same thing, parent education.
We also include the collin option which produces the "Collinearity Diagnostics" table
below. The very low eigenvalue for the 5th dimension (since there are 5 predictors) is
another indication of problems with multicollinearity. Likewise, the very high
"Condition Index" for dimension 5 similarly indicates problems with multicollinearity
with these predictors.
regression
/statistics=defaults tolcollin
/dependent api00
/method=enter acs_k3 avg_edgrad_schcol_gradsome_col.
<some output deleted to save space>
Coefficients(a)
167
Collinearity Diagnostics(a)
168
Collinearity Diagnostics(a)
169
Variables Entered/Removed(b)
Variables
Model Variables Entered Method
Removed
1 GNPCAP(a) . Enter
a All requested variables entered.
b Dependent Variable: BIRTH
Model Summary(b)
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .626(a) .392 .387 10.679
a Predictors: (Constant), GNPCAP
b Dependent Variable: BIRTH
ANOVA(b)
Mean
Model Sum of Squares df F Sig.
Square
Regression 7873.995 1 7873.995 69.047 .000(a)
1 Residual 12202.152 107 114.039
Total 20076.147 108
a Predictors: (Constant), GNPCAP
b Dependent Variable: BIRTH
Coefficients(a)
Standardized
Unstandardized Coefficients
Coefficients t Sig.
Model B Std. Error Beta
170
Residuals Statistics(a)
171
We modified the above scatter plot changing the fit line from using linear regression to
using "lowess" by choosing "Chart" then "Options" then choosing "Fit Options" and
choosing "Lowess" with the default smoothing parameters. As you can see, the
"lowess" smoothed curve fits substantially better than the linear regression, further
suggesting that the relationship between gnpcap and birth is not linear.
We can see that the capgnp scores are quite skewed with most values being near 0, and
a handful of values of 10,000 and higher. This suggests to us that some transformation
of the variable may be necessary. One commonly used transformation is a log
transformation, so let's try that. As you see, the scatter plot between capgnp and birth
looks much better with the regression line going through the heart of the data. Also, the
plot of the residuals by predicted values look much more reasonable.
computelgnpcap = ln(gnpcap).
regression
/dependent birth
/method=enter lgnpcap
/scatterplot(*zresid *pred) /scat(birth lgnpcap)
/save resid(bres2).
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
1 LGNPCAP(a) . Enter
a All requested variables entered.
b Dependent Variable: BIRTH
172
Model Summary(b)
ANOVA(b)
Sum of Mean
Model df F Sig.
Squares Square
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients t Sig.
Model B Std. Error Beta
173
Residuals Statistics(a)
Minimu
Maximum Mean Std. Deviation N
m
Predicted Value 12.86 50.25 32.79 10.305 109
Residual -24.75 24.98 .00 8.927 109
Std. Predicted Value -1.934 1.695 .000 1.000 109
Std. Residual -2.760 2.786 .000 .995 109
a Dependent Variable: BIRTH
174
This section has shown how you can use scatter plots to diagnose problems of non-
linearity, both by looking at the scatter plots of the predictor and outcome variable, as
well as by examining the residuals by predicted values. These examples have focused
on simple regression; however similar techniques would be useful in multiple
regression. However, when using multiple regression, it would be more useful to
examine partial regression plots instead of the simple scatter plots between the
predictor variables and the outcome variable.
A model specification error can occur when one or more relevant variables are omitted
from the model or one or more irrelevant variables are included in the model. If
relevant variables are omitted from the model, the common variance they share with
included variables may be wrongly attributed to those variables, and the error term can
be inflated. On the other hand, if irrelevant variables are included in the model, the
common variance they share with included variables may be wrongly attributed to
them. Model specification errors can substantially affect the estimate of regression
coefficients.
175
Consider the model below. This regression suggests that as class size increases the
academic performance increases, with p=0.053. Before we publish results saying that
increased class size is associated with higher academic performance, let's check the
model specification.
/dependent api00
/save pred(apipred).
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
SPSS does not have any tools that directly support the finding of specification errors,
however you can check for omitted variables by using the procedure below. As you
notice above, when we ran the regression we saved the predicted value calling it
apipred. If we use the predicted value and the predicted value squared as predictors of
the dependent variable, apipred should be significant since it is the predicted value, but
apipred squared shouldn't be a significant predictor because, if our model is specified
176
correctly, the squared predictions should not have much of explanatory power above
and beyond the predicted value. That is we wouldn't expect apipred squared to be a
significant predictor if our model is specified correctly. Below we compute apipred2 as
the squared value of apipred and then include apipred and apipred2 as predictors in
our regression model, and we hope to find that apipred2 is not significant.
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) 858.873 283.460 3.030 .003
-
1 APIPRED -1.869 .937 -1.088 .047
1.994
APIPRED2 2.344E-03 .001 1.674 3.070 .002
a Dependent Variable: API00
The above results show that apipred2 is significant, suggesting that we may have
omitted important variables in our regression. We therefore should consider whether we
should add any other variables to our model. Let's try adding the variable meals to the
above model. We see that meals is a significant predictor, and we save the predicted
value calling it preda for inclusion in the next analysis for testing to see whether we
have any additional important omitted variables.
regression
/dependent api00
/method=enter acs_k3 full meals
/save pred(preda).
177
<some output omitted to save space>
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) 771.658 48.861 15.793 .000
ACS_K3 -.717 2.239 -.007 -.320 .749
1
FULL 1.327 .239 .139 5.556 .000
MEALS -3.686 .112 -.828 -32.978 .000
a Dependent Variable: API00
We now create preda2 which is the square of preda, and include both of these as
predictors in our model.
compute preda2 = preda**2.
regression
/dependent api00
/method=enter preda preda2.
<some output omitted to save space>
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) -136.510 95.059 -1.436 .152
PREDA 1.424 .293 1.293 4.869 .000
1
-3.172E-
PREDA2 .000 -.386 -1.455 .146
04
a Dependent Variable: API00
We now see that preda2 is not significant, so this test does not suggest there are any
other important omitted variables. Note that after including meals and full, the
coefficient for class size is no longer significant. While acs_k3 does have a positive
relationship with api00 when only full is included in the model, but when we also
178
include (and hence control for) meals, acs_k3 is no longer significantly related to
api00 and its relationship with api00 is no longer positive.
Another way in which the assumption of independence can be broken is when data are
collected on the same variables over time. Let's say that we collect truancy data every
semester for 12 years. In this situation it is likely that the errors for observations
between adjacent semesters will be more highly correlated than for observations more
separated in time -- this is known as autocorrelation. When you have data that can be
considered to be time-series you can use the Durbin-Watson statistic to test for
correlated residuals.
We don't have any time-series data, so we will use the elemapi2 dataset and pretend
that snum indicates the time at which the data were collected. We will sort the data on
snum to order the data according to our fake time variable and then we can run the
regression analysis with the durbin option to request the Durbin-Watson test. The
Durbin-Watson statistic has a range from 0 to 4 with a midpoint of 2. The observed
value in our example is less than 2, which is not surprising since our data are not truly
time-series.
sort cases by snum .
regression
/dependent api00
/method=enter enroll
/residuals = durbin .
Model Summary
179
R Adjusted R Std. Error of the Durbin-
Model R
Square Square Estimate Watson
7.3.3.7. Summary
This chapter has covered a variety of topics in assessing the assumptions of regression
using SPSS, and the consequences of violating these assumptions. As we have seen, it
is not sufficient to simply run a regression analysis, but it is important to verify that the
assumptions have been met. If this verification stage is omitted and your data does not
meet the assumptions of linear regression, your results could be misleading and your
interpretation of your results could be in doubt. Without thoroughly checking your data
for problems, it is possible that another researcher could analyze your data and uncover
such problems and question your results showing an improved analysis that may
contradict your results and undermine your conclusions.
Minitab does not explicitly produce partial regression plots. Fortunately, they can be
created easily (if tediously, for large models):
180
Residuals Residuals
Standardized Studentized residuals divided by its standard
residuals error
Deleted t residuals Studentized deleted divided by its standard
residuals error, where is deleted
residual
In the Graphs… window in the regression procedure, these three kinds of residuals are
called Regular, Standardized, and Deleted, respectively. The standardized residuals are
what Minitab uses to flag unusually large residuals (any observations with standardized
residual greater than 2 in absolute value).
DFBETAS
Minitab does not explicitly produce DFBETAS statistics of influence on particular
coefficients. It can be calculated for a particular suspect observation i(perhaps flagged
by the preceding measures), and coefficient k, as follows:
(1) from the regression on the full data set, obtain
•
• (this is inversein the Storage… window)
(2) from the regression without observation i, obtain
•
181
•
(3) find , the diagonal element of
(4) calculate
REFERENCES
Minitab.MeetMinitab15.2007
182
183