8 A_Step-By-Step_Guide_to_Exploratory_Factor_Analysi..._----_(8._Step_3_Data_Screening)
8 A_Step-By-Step_Guide_to_Exploratory_Factor_Analysi..._----_(8._Step_3_Data_Screening)
STEP 3
Data Screening
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
42 Step 3
outlier, and the x4–y4 data show x remaining constant except for one (off the
chart) outlier.
Assumptions
All multivariate statistics are based on assumptions that will bias results if they are
violated. The assumptions of exploratory factor analysis (EFA) are mostly con
ceptual: It is assumed that some underlying structure exists, that the relationship
between measured variables and the underlying common factors are linear, and
Copyright © 2020. Taylor & Francis Group. All rights reserved.
that the linear coefficients are invariant across participants (Fabrigar & Wegener,
2012; Hair et al., 2019).
However, EFA is based on Pearson product-moment correlations that also
rely on statistical assumptions. Specifically, it is assumed that a linear relationship
exists between the variables and that there is an underlying normal distribution.
To meet these assumptions, variables must be measured on a continuous scale
(Bandalos, 2018; Puth et al., 2015; Walsh, 1996). Violation of the assumptions
that underlie the Pearson product-moment correlation may bias EFA results. As
suggested by Carroll (1961), “there is no particular point in making a factor
analysis of a matrix of raw correlation coefficients when these coefficients re
present manifest relationships which mask and distort latent relation
ships” (p. 356).
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 43
More broadly, anything that influences the correlation matrix can potentially
affect EFA results (Carroll, 1985; Onwuegbuzie & Daniel, 2002). As noted by
Warner (2007), “because the input to factor analysis is a matrix of correlations,
any problems that make Pearson r misleading as a description of the strength of
the relationship between pairs of variables will also lead to problems in factor
analysis” (p. 765). Accordingly, the data must be carefully screened before
conducting an EFA to ensure that some untoward influence has not biased the
results (Flora et al., 2012; Goodwin & Leech, 2006; Hair et al., 2019; Walsh,
1996). Potential influences include restricted score range, linearity, data dis
tributions, outliers, and missing data. “Consideration and resolution of these is
sues before the main analysis are fundamental to an honest analysis of the data”
(Tabachnick & Fidell, 2019, p. 52).
Linearity
Pearson coefficients are measures of the linear relationship between two variables.
That is, their relationship is best approximated by a straight line. Curvilinear or
nonlinear relationships will not be accurately estimated by Pearson coefficients.
Although subjective, visual inspection of scatterplots can be used to assess linearity.
Copyright © 2020. Taylor & Francis Group. All rights reserved.
The built-in graphics package offers a simple scatterplot for multiple variables. After
loading the iq data, a scatterplot for two measured variables can be created.
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
44 Step 3
After reviewing the scatterplots, it appears that the iq variables are linearly
related.
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 45
Data Distributions
Pearson correlation coefficients theoretically range from -1.00 to +1.00.
However, that is only possible when the two variables have exactly the same
distribution. If, for example, one variable is normally distributed and the other
distribution is skewed, the maximum value of the Pearson correlation is less than
1.00. The more the distribution shapes differ, the greater the restriction of r.
Consequently, it is important to understand the distributional characteristics of
the measured variables to be included in an EFA. For example, it has long been
Copyright © 2020. Taylor & Francis Group. All rights reserved.
known that dichotomous items that are skewed in opposite directions may
produce what are known as difficulty factors when submitted to EFA (Bernstein
& Teng, 1989; Greer et al., 2006). That is, a factor may appear that is an artifact of
variable distributions rather than the effect of their content. Using procedures
from the psych package, statistics can be computed to describe variable
distributions.
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
46 Step 3
Skew > 2.0 or kurtosis > 7.0 would indicate severe univariate nonnormality
(Curran et al., 1996). These univariate statistics seem to indicate that all eight
measured variables are relatively normally distributed (skew < 1.0 and kurtosis <
2.0) so there should not be much concern about correlations being restricted due
to variable distributions. Skew (departures from symmetry) and kurtosis (dis
tributions with heavier or lighter tails and higher or flatter peaks) of all variables
seem to be close to normal (normal distributions have expected values of zero).
The histograms displayed in the scatterplot matrix support that assumption.
box is the third quartile, the “whiskers” show the range of the data (excluding
outliers), and the circles identify outliers (defined as any value 1.5 times the
interquartile range).
A group of measured variables might exhibit univariate normality and yet be
multivariate nonnormal. That is, the joint distribution of all the variables might
be nonnormal. The psych package offers an implementation of Mardia's multi
variate tests (1970) and an accompanying Q–Q plot. The mardia output is
somewhat confusing because its descriptions are based on the notation found in
the 1970 paper by Mardia. For this case, we would report multivariate skew =
6.15 (p = .015) and multivariate kurtosis = 83.02 (p = .14).
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 47
The diagonal line in the Q–Q plot, displayed in Figure 8.5, represents a
Copyright © 2020. Taylor & Francis Group. All rights reserved.
theoretical normal distribution whereas the circles represent scores on the mea
sured variables in the iq data set. A linear trend in the measured variables visually
illustrates that it is plausible that the iq data came from a normal distribution.
As with many other procedures in R, the same results might be obtained from
a different package. For example, multivariate normality tests are also available in
the QuantPsyc package.
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
48 Step 3
multivariate kurtosis values > 3.0 to 5.0 might bias factor analysis results (Bentler, 2005;
Finney & DiStefano, 2013; Mueller & Hancock, 2019). Spearman or other types of
correlation coefficients might be more accurate in those instances (Bishara & Hittner,
2015; Onwuegbuzie & Daniel, 2002; Puth et al., 2015).
Outliers
As describe by Tabachnick and Fidell (2019), “an outlier is a case with such an
extreme value on one variable (a univariate outlier) or such a strange combination
of scores on two or more variables (multivariate outlier) that it distorts statistics”
(p. 62). Outliers are, therefore, questionable members of the data set. Outliers
may have been caused by data collection errors, data entry errors, a participant
not understanding the instructions, a participant deliberately entering invalid
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 49
responses, or a valid but extreme value. Not all outliers will influence the size of
correlation coefficients and subsequent factor analysis results but some may have a
major effect (Liu et al., 2012). For example, the correlation between the vocab1
and designs1 variables in the iq data set is .58. That correlation drops to -.04
when the final value in the matrix variable was entered as -999 rather than the
correct value of 113. A data entry error like this might be the result of a typo or
considering a missing data indicator as real data.
Obviously, some outliers can be detected by reviewing descriptive statistics. The
minimum and maximum values might reveal data that exceeds the possible values
that the data can take. For example, it is known that the values of the iq variables can
reasonably range from around 40 to 160. Any value outside that range is improbable
and must be addressed. One way to address such illegal values is to replace them with
a missing value indicator. In R, missing data are indicated by the characters NA. This
replacement can be automated by a procedure within the psych package:
Other outliers might be detected with plots as illustrated with the boxplot in
Figure 8.4. That plot clearly displays data points that are more than 1.5 times the
interquartile range. That might be a somewhat liberal standard given that some
Copyright © 2020. Taylor & Francis Group. All rights reserved.
experts suggest that 2.2 times the interquartile range be used (Streiner, 2018).
Nevertheless, those values are within plausible ranges and their cause is not clear.
Additionally, they are univariate outliers and EFA is a multivariate procedure that
necessitates that the multidimensional position of each data point be considered.
The Mahalanobis distance (D2) is a measure of the distance of each data point
from the mean of all data points in multidimensional space. Higher D2 values
represent observations farther removed from the general distribution of ob
servations in multidimensional space and high values are potential multivariate
outliers. D2 values can be tested for statistical significance but “it is suggested that
conservative levels of significance (e.g., .005 or .001) be used as the threshold
value for designation as an outlier” (Hair et al., 2019, p. 89). Unfortunately,
extreme outliers may negatively influence the accuracy of D2 values so robust D2
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
50 Step 3
estimation techniques have been recommended (DeSimone et al., 2015) and are
available in the faoutlier package.
Using p < .001 as the threshold (as suggested by Hair et al., 2019 as well as
Tabachnick & Fidell, 2019), cases 121 and 142 are potential outliers. Those cases
are also visible in the Robust MD graph (Figure 8.6) and a Q-Q plot
(Figure 8.7).
An examination of case 121 shows that it contains the previously identified
aberrant value of 37 for the vocab2 variable. However, all of the variable values
for this case are very low (45 to 58) and consistent with impaired intellectual
functioning. Given this consistency, there is no good reason to delete or modify
the value of this case. Case 142 is not as easily understood. Some of its values are
Copyright © 2020. Taylor & Francis Group. All rights reserved.
lower than average (e.g., 85) and others are higher than average (e.g., 127). There
is no obvious explanation for why these values are discrepant.
The psych package also contains an outlier function that can compute and
display Mahalanobis distance (D2) measures.
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 51
were no errors of this type and there is no obvious explanation for the out
lier—the outlier cannot be explained by a third variable affecting the person’s
score—the outlier should not be removed” (p. 260). Hair et al. (2019) expressed
similar sentiments about outliers: “they should be retained unless demonstrable
proof indicates that they are truly aberrant and not representative of any ob
servations in the population” (p. 91). Alternative suggestions for identifying and
reducing the effect of outliers have been offered (e.g., Tabachnick & Fidell,
2019). Regardless, extreme values might drastically influence EFA results so it is
incumbent upon the researcher to perform a sensitivity analysis. That is, con
duct EFAs with and without outlier data to verify that the results are robust
(Bandalos & Finney, 2019; Leys et al., 2018; Tabachnick & Fidell, 2019;
Thompson, 2004).
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
52 Step 3
FIGURE 8.7 Mahalanobis Distance (D2) Plot to Identify Potential Outliers in IQ Data
Missing Data
Ideally, there will be no missing data. In practice, there often are: people
sometimes skip items on tests or survey, are absent on exam day, etc. First de
scribed by Rubin (1976), it is now well accepted that the treatment of missing
data is contingent on the mechanism that caused the data to be missing. Data that
is missing completely at random (MCAR) is entirely unsystematic and not related
to any other value on any measured variable. For example, a person may acci
dently skip one question on a test. Data missing at random (MAR), contrary to its
label, is not missing at random. Rather, it is a situation where the missingness can
be fully accounted for by the remainder of the data. For instance, nonresponse to
self-esteem questions might be related to other questions, such as gender and age.
Copyright © 2020. Taylor & Francis Group. All rights reserved.
Finally, missing not at random (MNAR) applies when the missing data is related
to the reason that it is missing. For example, an anxious person might not respond
to survey questions dealing with anxiety.
It is useful to look for patterns at both variable and participant levels when
considering missing data (Fernstad, 2019). For example, the first variable in the
first illustration in Figure 8.8 seems to be problematic whereas the final partici
pant is problematic in the second illustration. The third illustration reveals rela
tively random missingness. Generally, randomly scattered missing data is less
problematic than other patterns (Tabachnick & Fidell, 2019).
Researchers tend to rely on two general approaches to dealing with missing
data: discard a portion of the data or replace missing values with estimated or
imputed values. When discarding data, the entire case can be discarded if one or
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 53
more of its data values are missing (listwise deletion). Alternatively, only the
actual missing values can be discarded (pairwise deletion). Most statistical pro
grams offer these two missing data methods. Both methods will reduce power
and may result in statistical estimation problems and biased parameter estimates
(Zygmont & Smith, 2014).
A wide variety of methods have been developed to estimate or impute missing
data values (Hair et al., 2019; Roth, 1994; Tabachnick & Fidell, 2019) that range
from simple (replace missing values with the mean value of that variable) to more
complex (predict the missing data value using nonmissing values via regression
analysis) to extremely complex (multiple imputation and maximum likelihood
estimation). Baraldi and Enders (2013) suggested that “researchers must formulate
logical arguments that support a particular missing data mechanism and choose an
analysis method that is most defensible, given their assumptions about missing
ness” (p. 639).
Unfortunately, there is no infallible way to statistically verify the missing data
mechanism and most methods used to deal with missing data values rely, at a
minimum, on the assumption of MAR. Considerable simulation research has
suggested that the amount of missing data may be a practical guide to dealing
with missing data. In general, if less than 5% to 10% of the data are missing in a
Copyright © 2020. Taylor & Francis Group. All rights reserved.
random pattern across variables and participants then any method of deletion or
imputation will be acceptable (Chen et al., 2012; Hair et al., 2019; Lee &
Ashton, 2007; Roth, 1994; Tabachnick & Fidell, 2019; Xiao et al., 2019).
When more than 10% of the data are missing, Newman (2014) suggested that
complex multivariate techniques, such as multiple imputation or maximum
likelihood estimation, be used. As with outliers, extensive missing data requires
a sensitivity analysis where the EFA results from different methods of dealing
with missing data are compared for robustness (Goldberg & Velicer, 2006; Hair
et al., 2019; Tabachnick & Fidell, 2019). Additionally, the amount and location
of missing data at variable and participant levels should be transparently
reported.
Missing Data in R. R is relative inflexible in its use of missing data
indicators in data files. The only recognizable missing data indicator is the
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
54 Step 3
DATA VALUES
It may be easier to delete all missing values from the data file before con
ducting an analysis.
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 55
The psych package contains a variety of missing data procedures that may be
more useful for EFA than the native R procedures for handling missing data.
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
56 Step 3
If desired, the imputed correlation matrix may be used in future EFA analyses
in place of the original raw data. Alternatively, the raw data can be submitted to
the EFA procedure with the specification that mean or median values be imputed
for missing values.
A visual depiction of the data set may also be useful in recognizing the extent
and pattern of missing data values. This can be accomplished with the Amelia
package (Figure 8.9).
Copyright © 2020. Taylor & Francis Group. All rights reserved.
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 57
More complex presentations of missing data can be obtained from the VIM
package.
As illustrated in the VIM output in Figure 8.10, 93.42% of the values are NOT
missing. At the measured variable level, vocab1 is missing 1.32% of its values
whereas veranal2 is missing .66% of its values, and vocab2 is missing 0% of its
values.
Copyright © 2020. Taylor & Francis Group. All rights reserved.
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
58 Step 3
Currently, maximum likelihood and multiple imputation are the most ap
propriate methods to apply when there is more than a trivial amount of missing
data (Enders, 2017). The maximum likelihood estimation method within the
psych package, as illustrated previously, is one option available in R. There are
several other R packages that might be employed if a multiple imputation
method is desired. These include the Amelia, MICE, and mitml packages. See
Enders (2017) for a tutorial on multiple imputation with the mitml package.
Report
Scatterplots revealed that linear relationships exist between the variables.
Measures of univariate and multivariate normality indicated an underlying normal
data distribution (Curran et al., 1996; Finney & DiStefano, 2013; Mardia, 1970).
There was no evidence that restriction of range or outliers substantially affected
the scores and there was no missing data. Therefore, a Pearson product-moment
correlation matrix was submitted for EFA.
Copyright © 2020. Taylor & Francis Group. All rights reserved.
Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.