Assignment On Factor Analysis - Zainullah
Assignment On Factor Analysis - Zainullah
Assignment # 01
Factor Analysis
Using SPSS Statistics 21
Submitted to:
Submitted by:
Mohammad Zainullah
Contents
Page No
Introduction 3
FA Equation 3
Sample Size 4
Data Screening 4
Dataset (wiscsem.sav) 4
Utilising SPSS 5
Variable View 6
Data View 6
Further Steps 7
Extraction 8
Rotation 9
The Output 10
Correlation Matrix 10
Communalities 11
Scree Plot 13
Factor Matrix 14
Revised Output 15
Scree Plot 16
Conclusion 18
References 18
3
Factor Analysis
Introduction
Factor analysis (FA) identifies "invisible" factors representing the hidden organization or
"organizing principle" of whatever is being measured with a number of observable measures or
scales (Navarro, F. H., 2006). In the illustrative example, “Verbal IQ” and “Performance IQ” have
been identified as the hidden organization or factors, while 11 variables as observable measures or
scales. Practitioners may use FA for a variety of purposes such as reducing a large number of items
from a questionnaire or survey instrument to a smaller number of components, uncovering latent
dimensions underlying a data set, or examining which items have the strongest association with a
given factor (DiStefano, Zhu & Mîndrilă, , 2009). Once a researcher has used and identified the
number of factors or components underlying a data set, information about the factors can be used in
subsequent analyses (Gorsuch, 1983).
Factor analysis is thus a method of data reduction. Data reduction is achieved by seeking underlying
un-observable (latent) variables that are reflected in the observed variables (manifest variables).
Many different methods to conduct a factor analysis are: Principal axis factor, Maximum likelihood,
Generalized least squares, Un-weighted least squares. Similarly, many different types of rotations
can be used after the initial extraction of factors, including Orthogonal rotations (varimax and
equimax), which requires the factors not to be correlated, and Oblique rotations (promax), which
allow the factors to be correlated with one another. Different factor analysis methods may leads to
different results analyzing the same data set.
In the assignment, factor analysis (exploratory) has been conducted using Maximum likelihood
method, while Varimax as the rotation method.
FA Equation:
FA is a dimensionality reduction multivariate and variable-focused technique i.e., FA represents,
the original variables X1, X2, X3, …… Xn in smaller numbers of underlying factors F 1, F2, F3,
………… Fm, whereas m<<<n. The underlying factors are latent or hidden or un-observable
variables.
Unlike Principal Component Analysis (PCA), FA is based on proper statistical model and the i th
original variable Xi can be given by
Xi-µi = li1F1 + li2F2 + …………………….. + limFm + εi
lim = ith factor loading or loading of mth factor on the ith variable or influence of Fm on Xi
It can be positive or negative (values range +1 to -1).
εi = ith Error term
4
lij 2 = Communality of the model, also noted as hi 2, represents part of variance contributed
by the factors, it’s like R2, the more communality value, the better the model will be.
Sample Size
Field (2005) reviews many suggestions about the sample size necessary for factor analysis and
concludes that it depends on many things. In general over 300 cases is probably adequate but
communalities after extraction should probably be above 0.5 (Field, 2005). Tabachnick and Fidell
(2001, page 588) cite Comrey and Lee's (1992) regarding sample size: 50 cases is very poor, 100 is
poor, 200 is fair, 300 is good, 500 is very good, and 1000 or more is excellent. As a rule of thumb,
a minimum of 10 observations per variable is necessary to avoid computational difficulties.
Data Screening
It is important to look at correlation between variables at first. This is because if the test questions
are measuring the same underlying dimension (s), then the questions are expected to correlate with
each other within reasonable limits. Variables represent questions, so if any variables are found that
do not correlate with any other variables or very few variables correlate with each other then these
variables should be excluded before conducting the factor analysis. The correlations between
variables can be determined using a correlation matrix of all the variables. The opposite problem is
when variables correlate too highly, so extreme multicollinearity and perfect correlation have to
be avoided.
Dataset (wiscsem.sav);
The following example demonstrates factor analysis (FA) of 11 subsets of the Wechsler Intelligence
Scale for Children (WISC). The model assesses the relationship between the indicators of IQ, the
two potential underlying constructs or factors representing IQ, i.e., the Verbal IQ and the
Performance IQ.
The Wechsler Intelligence Scale (WISC) is a test designed to measure intelligence in adults and
older adolescents. It is currently in its fourth edition (WAIS-IV), released in 2008 by Pearson. A
revised form of the test, the WAIS-R, was released in 1981 and consisted of six verbal and five
performance subtests. The verbal tests are: “Information, Comprehension, Arithmetic, Digit Span,
Similarities, and Vocabulary” and the Performance subtests are: “Picture Arrangement, Picture
5
Completion, Block Design, Object Assembly, and Digit Symbol”, which are used as variables in the
factor analysis to follow. The question was whether the “Verbal” vs “Non-verbal” can be distinctly
produced or not, with the appropriate subtests grouping into each category (Verbal IQ, Performance
IQ), using factor analysis. In the illustrative factor analysis, a “verbal IQ” and a “performance IQ”
were obtained as two finally extracted factors.
The dataset “wiscsem.sav” incorporating subscale scores for the Wechsler Intelligence Scale for
Children has been downloaded from the website given as under:
http://psych.colorado.edu/~carey/Courses/PSYC7291/ClassDataSets.htm
Utilising SPSS
After placing the dataset file “wiscsem.sav” in the folder C:\Documents and
Settings\Administrator\Desktop\FA June 23 Important\WAIS-R, the file activated within the SPSS
21 environment. The Variable View, Data View and the subsequent steps are shown on the next few
pages:
6
Variable View
Data View
7
Further Steps
To conduct FA, after initiating the program SPSS 21, selected “Analyze” menu and then chose
“Data Reduction” as FA is intended to reduce the complexity in a set of data, so after Analyze >
Data reduction, picked “Factor” for FA i.e., Analyze > Data Reduction > Factor as shown in
the figure given below:
To select an “extraction method” and a “rotation method.” the “Extraction” button is utilized to
specify extraction method.
8
Extraction Selection:
In this dialog box, the box labeled “Un-rotated factor solution” was left in its default setting, while
“Scree plot” checkbox checked to have a Scree diagram which is one of the ways to decide how
many factors to extract visually.
Thirdly, in the “Extract” section, the default setting is to use the Kaiser stopping criterion (i.e., all
factors with eigenvalues greater than 1) to decide how many factors to extract. Factors having a
higher eigenvalue can be proposed by setting the value in the specified filed. Alternatively if it is
already know about the number of factors to extract then the number can be used in the box.
After clicking the Continue, the main box will be in focus again for Rotation selection.
9
Rotation Selection:
Clicking the “Rotation” tab, leads to choose a “rotation method” for the factor analysis. A rotation
method gets factors as different from each other as possible, and thus helps to interpret the factors
by putting each variable primarily on one of the factors. I had to decide whether I wanted an
“orthogonal” solution (e.g., Quartimax, Equimax, Promax, Varimax) i.e., factors are not highly
correlated with each other, or an “oblique” solution (e.g., Direct Oblimin) i.e., factors are correlated
with one another. Varimax method for factor rotation has been chosen and used.
Also the “rotated solution” checkbox is checked to have the factor loadings for each individual
variable in the dataset, also to have make up names for different factors.
“Continue” in the sub-dialog, and then “OK” in the main dialog to see the output:
10
SPSS Output:
Correlation Matrix
Inform Compre Arith Simila Vocab Digit Picture Paragraph Block Object Coding
ation hension metic rities ulary Span Completion Arrangement Design Assembly
Information 1.000 .467 .494 .513 .625 .345 .230 .202 .229 .185 .007
Comprehension .467 1.000 .392 .510 .531 .236 .407 .187 .369 .322 .061
Arithmetic .494 .392 1.000 .369 .387 .269 .155 .227 .272 .043 .090
Similarities .513 .510 .369 1.000 .538 .260 .369 .298 .261 .269 -.041
Vocabulary .625 .531 .387 .538 1.000 .294 .285 .132 .297 .185 .100
Digit Span .345 .236 .269 .260 .294 1.000 .075 .148 .073 .035 .173
Picture Completion .230 .407 .155 .369 .285 .075 1.000 .249 .382 .363 -.072
Paragraph Arrangement .202 .187 .227 .298 .132 .148 .249 1.000 .351 .253 .038
Block Design .229 .369 .272 .261 .297 .073 .382 .351 1.000 .399 .107
Object Assembly .185 .322 .043 .269 .185 .035 .363 .253 .399 1.000 .053
Coding .007 .061 .090 -.041 .100 .173 -.072 .038 .107 .053 1.000
It is important to look at correlation between variables at first. The correlations between variables
can be determined by using a correlation matrix of all variables. The problem of collinearity will
exist if the correlation with other variables is higher than or equal to 0.9. Also extreme
multicollinearity and perfect correlation is to be avoided, which by looking into the correlation
matrix, no such problems identified.
Bartlett's Test of Sphericity - It tests the null hypothesis that the correlation matrix is an identity
matrix. P-value = .000 < .001 is significant and thus the null hypothesis can be rejected. i.e., the
correlation matrix is an identity matrix, is rejected, so factor analysis can be carried out.
Communalities
Extraction: The values in this column indicate the proportion of each variable's variance that can
be explained by the retained factors (F1, F2, F3). Variables with high values are well represented in
the common factor space, while variables with low values are not well represented. They are the
reproduced variances from the factors that I have extracted. I can find these values on the diagonal
of the reproduced correlation matrix.
The communalities for the ith variable are computed by taking the sum of the squared loadings for
that variable. This can be expressed as below:
12
For example, to compute the communality for the original variable “Information”, I have squared
the factor loadings for “Information” (from the Factor Matrix) and then added as under :
These values act like multiple R2 values for regression models predicting the variables of interest
from the 3 factors. In other words, if I perform multiple regression of original variable
“Information” against the three common factors, I can obtain an R2 = 0.637, indicating that about
63% of the variation in “Information” is explained by the factor model. The results in the table
given above, suggest that the factor analysis is better in explaining the variations in the variables
“Information, Comprehension, Similarities, and Vocabulary”.
So, one assessment of how well this model is doing can be obtained from the communalities, when
values are closer to one. This would indicate that the model explains most of the variation for
those variables. In this case, the model does better for some variables than it does for others. The
model explains “Block Design” the best, and better the other variables such as “Information,
Comprehension, Similarities and the Vocabulary”. However, for remaining variables such as
“Digit Span, Paragraph Arrangement”, the model does not do a good job, explaining only about
one quarter of the variation.
Factor: The initial number of factors is the same as the number of variables (11) used in the factor
analysis. However, not all 11 factors will be retained but with the help of Kaiser’s rule or Scree
plot, important factors will be extracted and retained.
Initial Eigenvalues: Eigenvalues are the variances of the factors. Each variable has a variance of
1, as the variables are standardized, and the total variance is equal to the number of variables used
in the analysis, which is 11.
Total: This column contains the eigenvalues. The first factor will always account for the most
variance and hence have the highest eigenvalue, each successive factor will account for less and less
variance.
% of Variance: This column contains the percent of total variance accounted for by each factor. So
34.806% of total variance is explained by or accounted for Factor 1, 13.109% of the total variance
explained by Factor 2, and 10.147% by Factor 3.
Cumulative %: This column contains the cumulative percentage of variance accounted for by the
current and all preceding factors. For example, the third row shows a value of 58.062. This means
that the first three factors together account for 58.062% of the total variance.
Extraction Sums of Squared Loadings: In this section the number of factors retained are
mentioned, one row for each retained factor. The values are based on the common variance and not
on the total variance.
Rotation Sums of Squared Loadings: It represents the distribution of the variance after the
Varimax rotation. Varimax rotation tries to maximize the variance of each of the factors, so the
total amount of variance accounted for is redistributed over the three extracted factors.
Scree Plot:
14
The Scree plot is the graph of eigenvalue against the factor number. In the Scree Plot, the slope of
curve seems to levels out after two factors, where as Kaiser’s rule (Eigen values > 1) guides to
having 3 factors. From the second factor on, the line is almost flat, meaning the each successive
factor is accounting for smaller and smaller amounts of the total variance, so retaining two factors is
recommended (Cattell, 1966).
Factor Matrix
Factor
1 2 3
Information .779 .156 .073
Comprehension .551 .449 -.032
Arithmetic .556 .140 .269
Similarities .620 .366 -.160 Extraction Method: Maximum Likelihood.
Vocabulary .721 .252 .035 Rotation Method: Varimax with Kaiser Normalization.
Digit Span .431 -.003 .134 Rotation converged in 5 iterations.
Picture .202 .605 -.194
Completion
Paragraph .154 .392 .135
Arrangement
Block Design .118 .713 .379
Object Assembly .084 .573 -.050
Coding .054 .005 .290
15
The Rotated Factor Matrix shows the factor loadings (correlations between variables and factors,
and how the variables are weighted for each factor) for each variable, i.e., highlighting the factor
that each variable loaded most strongly on (high positive loadings). Based on these factor loadings,
the three factors are spotted as under:
1. The first 6 variables load high positive on Factor 1, which can be termed as “Verbal IQ”,
these are “Information, Comprehension, Arithmetic, Similarities, Vocabulary and Digital
Span”.
2. The variables “Picture Completion, Paragraph Arrangement, Block Design, Object
Assembly” load strongly or high positive on Factor 2, which can be termed as “Performance
IQ”
3. The variable named Coding load positively on Factor 3. Probably Factor 3 is “Freedom from
Distraction,” because these are concentration-intensive tasks. But these factor loadings
(correlations between variables and factors and how the variables are weighted for each
factor or the variables load on factors) are less than 0.3, implying no more meaningful, so
preferring 2-factor solution and re-conducting factor analysis with a pre-set 2-factor
solution.
Revised Output
It was important to know whether I can differentiate “verbal” from “nonverbal” tasks, i.e., Verbal
IQ from Performance IQ? I have got a 3-factor solution, based on Kaiser’s Rule (Eigen Values >
1), the variables “Digit Span”, and “Coding” loadings on factor 3 (weak positively), creating some
confusion so forcing SPSS’s manually to extract two factors F1 and F2.
To achieve the pre-set number of factors, going back to the main dialog, and then to the
“Extraction” sub-dialog. Under “Extract,” inserted “Number of factors = 2” and clicked continue
and then “OK” to run the analysis.
16
The revised output has two extracted factors, and that factors account for a 37.458 % of the total
variability in the variables.
Scree Plot:
Factor
1 2
Information .783 .172
Comprehension .534 .471 Extraction Method: Maximum Likelihood.
Arithmetic .560 .153 Rotation Method: Varimax with Kaiser
Similarities .584 .386 Normalization.
Vocabulary .727 .255
Rotation converged in 3 iterations.
Digit Span .430 .022
Picture Completion .176 .601
Paragraph Arrangement .146 .407
Block Design .168 .614
Object Assembly .056 .610
Coding .069 .020
17
In the Rotated Factor Matrix, I have the revised factor loadings (correlations between variables and
factors and how the variables are weighted for each factor or the variables load on factors). The
variable “Coding” doesn’t load strongly on either of the extracted factors 1 or 2, but the two factors
of “Verbal” and “Performance” IQ have relatively high positive factor loadings and have thus
emerged more strongly. These factors can be used as variables for further analysis.
Factor scores FAC1_1 and FAC2_1 are the composite variables which provide information about a
variable’s placement on the factor(s). Once a researcher has used FA and has identified the number
of factors or components underlying a data set, he may wish to use the information about the factors
in subsequent analyses (Gorsuch, 1983). To use FA information in follow-up studies, the researcher
must create scores to represent each individual’s placement on the factor(s) identified from the FA.
These factor scores may then be used to investigate the research questions of interest (DiStefano,
Zhu & Mîndrilă, 2009).
In the current factor analysis, factor scores thus indicate how each of the "hidden" factors (F 1 and
F2) associated with the "observable" variables used in the analysis.
18
Conclusion:
I have eleven (11) observable variables where two hidden factors F1 and F2 are identified by
conducting factor analysis. Factor loadings on hidden Factor 1 across the six variables: are 0.783,
0.534, 0.560, 0.584, 0.727, 0.430. These factor loadings indicate that observable measures 1
through 6 can be used to "describe" hidden Factor 1; in other words, Factor 1 has characteristics
very similar to what observable measures 1 through 6 measure. Observable variables 7 through 11
are not useful to describe hidden Factor 1 because their factor loadings on hidden Factor 1 are too
small (not > or = to .50).
Similarly, factor loadings on hidden Factor 2 across the 4 variables are 0.601, 0.407, 0.614, 0.610,
these factor loadings indicate that observable measures or variable 7 through 10 can be used to
"describe" hidden Factor 2; in other words, Factor 2 has characteristics very similar to what
observable measures or variable 7 through 10 measure. Observable variables 1 through 6 and 11 are
not useful to describe hidden Factor 2 because their factor loadings on hidden Factor 1 are too small
(not > or = 0.50).
Factor analysis has thus identified "invisible" factors F 1 and F2, which represent the hidden
organization or "organizing principle" of Verbal IQ and Performance IQ, with a number of
observable measures or scales (Navarro, F. H., 2006).
References:
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate behavioral research,
1(2), 245-276.
DiStefano, C., Zhu, M., & Mindrila, D. (2009). Understanding and using factor scores:
Considerations for the applied researcher. Practical Assessment, Research & Evaluation, 14(20), 1-
11.
Field, A. (2005). Discovering statistics using SPSS (2nd edition), London: Sage
Gorsuch, R. (1983). Factor analysis. Hillsdale, NJ: L. Erlbaum Associates.
Hutcheson, G. D., & Sofroniou, N. (1999). The multivariate social scientist: Introductory statistics
using generalized linear models. Sage.
Tabachnick, B. G., & Fidell, L. S. (2001). Using Multivariate Statistics (4th Ed.). Needham
Heights, MA: Allyn & Bacon.