Factor Analysis
Factor Analysis
Factor Analysis
A non-technical analogy
A mother sees various bumps and shapes under a blanket at the
bottom of a bed. When one shape moves toward the top of the bed,
all the other bumps and shapes move toward the top also, so the
mother concludes that what is under the blanket is a single thing,
most likely her child. Similarly, factor analysis takes as input a
number of measures and tests, analogous to the bumps and
shapes. Those that move together are considered a single thing,
which it labels a factor. That is, in factor analysis the researcher is
assuming that there is a "child" out there in the form of an underlying
factor, and he or she takes simultaneous movement (correlation) as
evidence of its existence. If correlation is spurious for some reason,
this inference will be mistaken, of course, so it is important when
conducting factor analysis that possible variables which might
introduce spuriousness, such as anteceding causes, be included in
the analysis and taken into account.
Initial Considerations
Sample Size
Correlation coefficients fluctuate from sample to sample, much more
so in small samples than in large.
Therefore, the reliability of factor analysis is also dependent on
sample size.
Much has been written about the necessary sample size for factor
analysis resulting in many rules-of-thumb.
The common rule is to suggest that a researcher has at least 10-15
subjects per variable.
Although Ive heard this rule bandied about on numerous occasions its
empirical basis is unclear (although Nunnally, 1978 did recommend
having 10 times as many subjects as variables).
Kass and Tinsley (1979) recommended having between 5 and 10
subjects per variable up to a total of 300 (beyond which test parameters
tend to be stable regardless of the subject to variable ratio).
In fact, Tabachnick and Fidell (1996) agree that 'it is comforting to have
at least 300 cases for factor analysis (p. 640) and Comrey and Lee
(1992) class 300 as a good sample size, 100 as poor and 1000 as
excellent.
So,
whats clear from this work is that a sample of 300 or more
will probably provide a stable factor solution but that a wise
researcher will measure enough variables to adequately
measure all of the factors that theoretically they would expect
to find.
Data Screening
If we find any variables that do not correlate with
any other variables (or very few) then you should
consider excluding these variables before the
factor analysis is run.
In this case, variables correlate only with themselves and
all other correlation coefficients are close to zero.
Our Example
The SPSS Stress
Test
There are several options available, the first of which can be accessed by clicking
on
to access the dialog box in the preceding Figure.
The Univariate descriptives option provides means and standard deviations for each
variable.
The choice of which of the two variables to eliminate will be fairly arbitrary and finding
multicollinearity in the data should raise questions about the choice of items within your
questionnaire.
KMO and Bartletts test of sphericity produces the Kaiser-Meyer-Olkin measure of
sampling adequacy and Bartletts test.
With a sample of 2571 we shouldnt have cause to worry about the sample size.
The value of KMO should be greater than 0.5 if the sample is adequate.
When you have finished with this dialog box click on _____ to return to the main dialog
box.
Rotation Techniques
The interpretability of factors can be improved through rotation.
Rotation maximizes the loading of each variable on one of the extracted factors while minimizing
the loading on all other factors.
This process makes it much clearer which variables relate to which factors.
Rotation works through changing the absolute values of the variables while keeping their
differential values constant.
Click on _
to access the dialog box.
Varimax, quartimax and equamax are all orthogonal rotations while direct oblimin and promax are oblique
rotations.
Quartimax rotation attempts to maximize the spread of factor loadings for a variable across all factors.
Varimax is the opposite in that it attempts to maximize the dispersion of loadings within factors.
Therefore, it tries to load a smaller number of variables highly onto each factor resulting in more interpretable clusters of
factors.
Equamax is a hybrid of the other two approaches and is reported to behave fairly erratically.
In most circumstances the default of 25 is more than adequate for SPSS to find a solution for a given
data set.
However, if you have a large data set (like we have here) then the computer might have difficulty finding a
solution (especially for oblique rotation). To allow for the large data set we are using change the value to 30.
Scores
The factor scores dialog box can be accessed by clicking in the
main dialog box.
This option allows you to save factor scores for each subject in the data
editor.
SPSS creates a new column for each factor extracted and then places
the factor score for each subject within that column.
These scores can then be used for further analysis, or simply to identify
groups of subjects who score highly on particular factors.
Options
This set of options can be obtained by clicking on in the main dialog box.
Missing data are a problem for factor analysis just like most other procedures and SPSS
provides a choice of excluding cases or estimating a value for a case.
You should consider the distribution, of missing data.
If the missing data are non-normally distributed or the sample size after exclusion is
too small then estimation is necessary.
SPSS uses the mean as an estimate (Replace with mean).
These procedures lower the standard deviation of variables and so can lead to
significant results that would otherwise be non-significant.
Therefore, if missing data are random, you might consider excluding cases. SPSS
allows you to either Exclude cases listwise in which case any subject with missing
data for any variable is excluded, or to Exclude cases pairwise in which case a
subject s data are excluded only from calculations for which a datum is missing.
The final two options relate to how coefficients are displayed.
By default SPSS will list variables in the order in which they are entered into the data editor. Usually, this
format is most convenient.
However, when interpreting factors it is sometimes useful to list variables by size.
By selecting Sorted by size, SPSS will order the variables by their factor loadings.
In fact, it does this sorting fairly intelligently so that all of the variables that load highly onto the same
factor are displayed together.
The second option is to Suppress absolute values less than a specified value (by default 0.1).
This option ensures that factor loadings within +0.1 are not displayed in the output.
Again, this option is useful for assisting in interpretation.
The default value is not that useful and I recommend changing it either to 0.4 (for
interpretation purposes) or to a value reflecting the expected value of a significant
factor loading given the sample size.
For this example set the value at .40.
Preliminary Analysis
The first body of output concerns data screening, assumption testing and sampling adequacy.
Youll find several large tables (or matrices) that tell us interesting things about our data.
If you selected the Univariate descriptives option then the first table will contain descriptive
statistics for each variable (the mean, standard deviation and number of cases).
The table also includes the number of missing cases; this summary is a useful way to
determine the extent of missing data.
In summary, all questions in the SPSS Stress Test correlate fairly well with all others (this is partly
because of the large sample) and none of the correlation coefficients are particularly large; therefore,
there is no need to consider eliminating any questions at this stage.
Factor Extraction
The first part of the factor extraction
process is to determine the linear
components within the data set (the
eigenvectors) by calculating the
eigenvalues of the R-matrix.
We know that there areas many
components (eigenvectors) in the Rmatrix as there are variables, but most
will be unimportant.
To determine the importance of a
particular vector we look at the
magnitude of the associated
eigenvalue.
We can then apply criteria to determine
which factors to retain and which to
discard.
Factor Rotation
The first analysis to run was an orthogonal rotation
(Varimax).
However, we could have runn the analysis using oblique
rotation.
Sometimes the analysis differ by Method.
First, factor loadings less than 0.4 have not been displayed because we asked
for these loadings to be suppressed using the option.
If you didnt select this option, or didnt adjust the criterion value to 0.4, then your
output will differ.
Second, the variables are listed in the order of size of their factor loadings.
By default, SPSS orders the variables as they are in the data editor; however, we
asked for the output to be Sorted by size using the option.
If this option was not selected your output will look different.
I have allowed the variable labels to be printed to aid interpretation.
The original logic behind suppressing loadings less than 0.4 was based on
Stevens (1992; a stats guru) suggestion that this cut-off point was appropriate
for interpretative purposes (i.e. loadings greater than 0.4 represent substantive
values).
The next step is to look at the content of questions that load onto the same
factor to try to identify common themes.
If the mathematical factor produced by the analysis represents some real-world
construct then common themes among highly loading questions can help us
identify what the construct might be.
The questions that load highly on Factor I seem to all relate to using computers or
SPSS.
Therefore we might label this factor -- fear of computers.
The questions that load highly on factor 2 all seem to relate to different aspects of
statistics; therefore, we might label this factor fear of statistics.
The three questions that load highly on factor 3 all seem to relate to mathematics;
therefore, we might label this factor fear of mathematics.
Finally, the questions that load highly on factor 4 all contain some component of social
evaluation from friends; therefore, we might label this factor peer evaluation.
This analysis seems to reveal that the initial questionnaire, in reality, is composed
of four sub-scales: fear of computers, fear of statistics, fear of mathematics, and
fear of negative peer evaluation.
There are two possibilities here.
The first is that the SPSS Stress Test failed to measure what it set out to (namely
SPSS anxiety) but does measure some related constructs.
The second is that these four constructs are sub-components of SPSS anxiety;
however, the factor analysis does not indicate which of these possibilities is true.
It should be pretty clear that subject 9 scored highly on all four factors
and so this person is very anxious about statistics, computing and
maths, but less so about peer evaluation (factor 4).
Factor scores can be used in this way to assess the relative fear of
one person compared to another, or we could add the scores up to
obtain a single score for each subject (that we might assume
represents SPSS anxiety as a whole).
We can also use factor scores in regression when groups of
predictors correlate so highly that there is multicollinearity.