0% found this document useful (0 votes)
2 views

8 A_Step-By-Step_Guide_to_Exploratory_Factor_Analysi..._----_(8._Step_3_Data_Screening)

Effective data screening combines statistical inspection and graphical analysis to identify underlying relationships in data, as demonstrated by Anscombe's quartet. Exploratory factor analysis (EFA) relies on assumptions such as linear relationships and normal distributions, which must be validated to avoid biased results. Key considerations for screening include restricted score range, linearity, data distributions, and the presence of outliers, all of which can significantly impact EFA outcomes.

Uploaded by

cristian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

8 A_Step-By-Step_Guide_to_Exploratory_Factor_Analysi..._----_(8._Step_3_Data_Screening)

Effective data screening combines statistical inspection and graphical analysis to identify underlying relationships in data, as demonstrated by Anscombe's quartet. Exploratory factor analysis (EFA) relies on assumptions such as linear relationships and normal distributions, which must be validated to avoid biased results. Key considerations for screening include restricted score range, linearity, data distributions, and the presence of outliers, all of which can significantly impact EFA outcomes.

Uploaded by

cristian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

8

STEP 3
Data Screening

Effective data screening involves inspection of both statistics and graphics


(Hoelzle & Meyer, 2013; Malone & Lubansky, 2012). Either alone is insufficient.
This was famously demonstrated by Anscombe (1973) who created four x-y data
sets with relatively equivalent summary statistics. A quick scan seems to indicate
relatively normal distributions with no obvious problem.

BOX 8.1 DESCRIPTIVE STATISTICS AND CORRELATION


MATRIX FOR ANSCOMBE QUARTET DATA
Copyright © 2020. Taylor & Francis Group. All rights reserved.

However, there is danger in relying on summary statistics alone. When this


“Anscombe quartet” is graphed, the real relationships in the data emerge
(Figure 8.1). Specifically, the x1–y1 data appear to follow a rough linear re­
lationship with some variability, the x2–y2 data display a curvilinear rather than a
linear relationship, the x3–y3 data depict a linear relationship except for one large

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
42 Step 3

FIGURE 8.1 Ansco Scatterplots for Anscombe Quartet Data

outlier, and the x4–y4 data show x remaining constant except for one (off the
chart) outlier.

Assumptions
All multivariate statistics are based on assumptions that will bias results if they are
violated. The assumptions of exploratory factor analysis (EFA) are mostly con­
ceptual: It is assumed that some underlying structure exists, that the relationship
between measured variables and the underlying common factors are linear, and
Copyright © 2020. Taylor & Francis Group. All rights reserved.

that the linear coefficients are invariant across participants (Fabrigar & Wegener,
2012; Hair et al., 2019).
However, EFA is based on Pearson product-moment correlations that also
rely on statistical assumptions. Specifically, it is assumed that a linear relationship
exists between the variables and that there is an underlying normal distribution.
To meet these assumptions, variables must be measured on a continuous scale
(Bandalos, 2018; Puth et al., 2015; Walsh, 1996). Violation of the assumptions
that underlie the Pearson product-moment correlation may bias EFA results. As
suggested by Carroll (1961), “there is no particular point in making a factor
analysis of a matrix of raw correlation coefficients when these coefficients re­
present manifest relationships which mask and distort latent relation­
ships” (p. 356).

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 43

More broadly, anything that influences the correlation matrix can potentially
affect EFA results (Carroll, 1985; Onwuegbuzie & Daniel, 2002). As noted by
Warner (2007), “because the input to factor analysis is a matrix of correlations,
any problems that make Pearson r misleading as a description of the strength of
the relationship between pairs of variables will also lead to problems in factor
analysis” (p. 765). Accordingly, the data must be carefully screened before
conducting an EFA to ensure that some untoward influence has not biased the
results (Flora et al., 2012; Goodwin & Leech, 2006; Hair et al., 2019; Walsh,
1996). Potential influences include restricted score range, linearity, data dis­
tributions, outliers, and missing data. “Consideration and resolution of these is­
sues before the main analysis are fundamental to an honest analysis of the data”
(Tabachnick & Fidell, 2019, p. 52).

Restricted Score Range


The range of scores on the measured variables must be considered. If the sample is
more homogeneous than the population, restriction of range in the measured vari­
ables can result and thereby attenuate correlations among the variables. This at­
tenuation can result in biased EFA estimates. For example, using quantitative and
verbal test scores from the 1949 applicant pool of the U.S. Coast Guard Academy,
the quantitative and verbal test score correlations dropped from .50 for all 2,253
applicants to only .12 for the 128 students who entered the Academy (French et al.,
1952). In such cases, a “factor cannot emerge with any clarity” (Kline, 1991, p. 16).

Linearity
Pearson coefficients are measures of the linear relationship between two variables.
That is, their relationship is best approximated by a straight line. Curvilinear or
nonlinear relationships will not be accurately estimated by Pearson coefficients.
Although subjective, visual inspection of scatterplots can be used to assess linearity.
Copyright © 2020. Taylor & Francis Group. All rights reserved.

The built-in graphics package offers a simple scatterplot for multiple variables. After
loading the iq data, a scatterplot for two measured variables can be created.

BOX 8.2 SCATTERPLOT WITH REGRESSION LINE FROM


GRAPHICS PACKAGE

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
44 Step 3

FIGURE 8.2 Scatterplot for Two Variables

RStudio implements this command and automatically displays that scatterplot


in the Plots tab as demonstrated in Figure 8.2. The scatterplot can be exported as
an image by selecting the Export > Save as Image from the Plots menu. That
selection brings up a new window that allows the type of image file (TIFF, PNG,
JPEG, etc.) to be specified as well as the image’s dimensions and file location.
It may be more efficient to review a scatterplot matrix rather than individual
scatterplots (Figure 8.3). This is easily accomplished in R using the pairs.panels
function from the psych package. A scatterplot matrix can also be exported as a
graphics file.
Copyright © 2020. Taylor & Francis Group. All rights reserved.

After reviewing the scatterplots, it appears that the iq variables are linearly
related.

BOX 8.3 SCATTERPLOT MATRIX FROM GRAPHICS PACKAGE

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 45

FIGURE 8.3 Scatterplot Matrix

Data Distributions
Pearson correlation coefficients theoretically range from -1.00 to +1.00.
However, that is only possible when the two variables have exactly the same
distribution. If, for example, one variable is normally distributed and the other
distribution is skewed, the maximum value of the Pearson correlation is less than
1.00. The more the distribution shapes differ, the greater the restriction of r.
Consequently, it is important to understand the distributional characteristics of
the measured variables to be included in an EFA. For example, it has long been
Copyright © 2020. Taylor & Francis Group. All rights reserved.

known that dichotomous items that are skewed in opposite directions may
produce what are known as difficulty factors when submitted to EFA (Bernstein
& Teng, 1989; Greer et al., 2006). That is, a factor may appear that is an artifact of
variable distributions rather than the effect of their content. Using procedures
from the psych package, statistics can be computed to describe variable
distributions.

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
46 Step 3

BOX 8.4 SUMMARY STATISTICS FROM PSYCH PACKAGE

Skew > 2.0 or kurtosis > 7.0 would indicate severe univariate nonnormality
(Curran et al., 1996). These univariate statistics seem to indicate that all eight
measured variables are relatively normally distributed (skew < 1.0 and kurtosis <
2.0) so there should not be much concern about correlations being restricted due
to variable distributions. Skew (departures from symmetry) and kurtosis (dis­
tributions with heavier or lighter tails and higher or flatter peaks) of all variables
seem to be close to normal (normal distributions have expected values of zero).
The histograms displayed in the scatterplot matrix support that assumption.

BOX 8.5 BOXPLOT FROM GRAPHICS PACKAGE

Graphs can be useful for visual verification of this conclusion. A multi­


colored boxplot that displays the distributional statistics of the measured
variables can be generated with the R code presented in Box 8.5. Boxplots, as
depicted in Figure 8.4, have the following characteristics: the thick line in the
box is the median, the bottom of the box is the first quartile, the top of the
Copyright © 2020. Taylor & Francis Group. All rights reserved.

box is the third quartile, the “whiskers” show the range of the data (excluding
outliers), and the circles identify outliers (defined as any value 1.5 times the
interquartile range).
A group of measured variables might exhibit univariate normality and yet be
multivariate nonnormal. That is, the joint distribution of all the variables might
be nonnormal. The psych package offers an implementation of Mardia's multi­
variate tests (1970) and an accompanying Q–Q plot. The mardia output is
somewhat confusing because its descriptions are based on the notation found in
the 1970 paper by Mardia. For this case, we would report multivariate skew =
6.15 (p = .015) and multivariate kurtosis = 83.02 (p = .14).

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 47

FIGURE 8.4 Boxplot

BOX 8.6 MARDIA’S MULTIVARIATE SKEW AND KURTOSIS


FROM PSYCH PACKAGE

The diagonal line in the Q–Q plot, displayed in Figure 8.5, represents a
Copyright © 2020. Taylor & Francis Group. All rights reserved.

theoretical normal distribution whereas the circles represent scores on the mea­
sured variables in the iq data set. A linear trend in the measured variables visually
illustrates that it is plausible that the iq data came from a normal distribution.
As with many other procedures in R, the same results might be obtained from
a different package. For example, multivariate normality tests are also available in
the QuantPsyc package.

BOX 8.7 MARDIA’S MULTIVARIATE SKEW AND KURTOSIS


FROM QUANTPSYC PACKAGE

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
48 Step 3

FIGURE 8.5 Normal Q–Q Plot

Nonnormality, especially kurtosis, can bias Pearson correlation estimates and


thereby bias EFA results (Cain et al., 2017; DeCarlo, 1997; Greer et al., 2006). The
extent to which variables can be nonnormal and not substantially affect EFA results has
been addressed by several researchers. Curran et al. (1996) opined that univariate skew
should not exceed 2.0 and univariate kurtosis should not exceed 7.0. Other mea­
surement specialists have agreed with those guidelines (Bandalos, 2018; Fabrigar et al.,
1999; Wegener & Fabrigar, 2000). In terms of multinormality, statistically significant
Copyright © 2020. Taylor & Francis Group. All rights reserved.

multivariate kurtosis values > 3.0 to 5.0 might bias factor analysis results (Bentler, 2005;
Finney & DiStefano, 2013; Mueller & Hancock, 2019). Spearman or other types of
correlation coefficients might be more accurate in those instances (Bishara & Hittner,
2015; Onwuegbuzie & Daniel, 2002; Puth et al., 2015).

Outliers
As describe by Tabachnick and Fidell (2019), “an outlier is a case with such an
extreme value on one variable (a univariate outlier) or such a strange combination
of scores on two or more variables (multivariate outlier) that it distorts statistics”
(p. 62). Outliers are, therefore, questionable members of the data set. Outliers
may have been caused by data collection errors, data entry errors, a participant
not understanding the instructions, a participant deliberately entering invalid

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 49

responses, or a valid but extreme value. Not all outliers will influence the size of
correlation coefficients and subsequent factor analysis results but some may have a
major effect (Liu et al., 2012). For example, the correlation between the vocab1
and designs1 variables in the iq data set is .58. That correlation drops to -.04
when the final value in the matrix variable was entered as -999 rather than the
correct value of 113. A data entry error like this might be the result of a typo or
considering a missing data indicator as real data.
Obviously, some outliers can be detected by reviewing descriptive statistics. The
minimum and maximum values might reveal data that exceeds the possible values
that the data can take. For example, it is known that the values of the iq variables can
reasonably range from around 40 to 160. Any value outside that range is improbable
and must be addressed. One way to address such illegal values is to replace them with
a missing value indicator. In R, missing data are indicated by the characters NA. This
replacement can be automated by a procedure within the psych package:

BOX 8.8 SUMMARY STATISTICS WITH REPLACEMENT VALUES


FROM PSYCH PACKAGE

Other outliers might be detected with plots as illustrated with the boxplot in
Figure 8.4. That plot clearly displays data points that are more than 1.5 times the
interquartile range. That might be a somewhat liberal standard given that some
Copyright © 2020. Taylor & Francis Group. All rights reserved.

experts suggest that 2.2 times the interquartile range be used (Streiner, 2018).
Nevertheless, those values are within plausible ranges and their cause is not clear.
Additionally, they are univariate outliers and EFA is a multivariate procedure that
necessitates that the multidimensional position of each data point be considered.
The Mahalanobis distance (D2) is a measure of the distance of each data point
from the mean of all data points in multidimensional space. Higher D2 values
represent observations farther removed from the general distribution of ob­
servations in multidimensional space and high values are potential multivariate
outliers. D2 values can be tested for statistical significance but “it is suggested that
conservative levels of significance (e.g., .005 or .001) be used as the threshold
value for designation as an outlier” (Hair et al., 2019, p. 89). Unfortunately,
extreme outliers may negatively influence the accuracy of D2 values so robust D2

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
50 Step 3

estimation techniques have been recommended (DeSimone et al., 2015) and are
available in the faoutlier package.

BOX 8.9 ROBUST MAHALANOBIS DISTANCE FROM FAOUTLIER


PACKAGE

Using p < .001 as the threshold (as suggested by Hair et al., 2019 as well as
Tabachnick & Fidell, 2019), cases 121 and 142 are potential outliers. Those cases
are also visible in the Robust MD graph (Figure 8.6) and a Q-Q plot
(Figure 8.7).
An examination of case 121 shows that it contains the previously identified
aberrant value of 37 for the vocab2 variable. However, all of the variable values
for this case are very low (45 to 58) and consistent with impaired intellectual
functioning. Given this consistency, there is no good reason to delete or modify
the value of this case. Case 142 is not as easily understood. Some of its values are
Copyright © 2020. Taylor & Francis Group. All rights reserved.

lower than average (e.g., 85) and others are higher than average (e.g., 127). There
is no obvious explanation for why these values are discrepant.
The psych package also contains an outlier function that can compute and
display Mahalanobis distance (D2) measures.

BOX 8.10 MAHALANOBIS DISTANCE FROM PSYCH PACKAGE

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 51

FIGURE 8.6 Robust MD Graph

It is important to articulate an outlier policy prior to data analysis (Leys et al.,


2018). Not to do so makes the researcher vulnerable to interpreting this am­
biguous information inconsistent with best statistical practice (Simmons et al.,
2011). Although there is considerable debate among statisticians as to the ad­
visability of deleting outliers, Goodwin and Leech (2006) suggested that “the
researcher should first check for data collection or data entry errors. If there
Copyright © 2020. Taylor & Francis Group. All rights reserved.

were no errors of this type and there is no obvious explanation for the out­
lier—the outlier cannot be explained by a third variable affecting the person’s
score—the outlier should not be removed” (p. 260). Hair et al. (2019) expressed
similar sentiments about outliers: “they should be retained unless demonstrable
proof indicates that they are truly aberrant and not representative of any ob­
servations in the population” (p. 91). Alternative suggestions for identifying and
reducing the effect of outliers have been offered (e.g., Tabachnick & Fidell,
2019). Regardless, extreme values might drastically influence EFA results so it is
incumbent upon the researcher to perform a sensitivity analysis. That is, con­
duct EFAs with and without outlier data to verify that the results are robust
(Bandalos & Finney, 2019; Leys et al., 2018; Tabachnick & Fidell, 2019;
Thompson, 2004).

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
52 Step 3

FIGURE 8.7 Mahalanobis Distance (D2) Plot to Identify Potential Outliers in IQ Data

Missing Data
Ideally, there will be no missing data. In practice, there often are: people
sometimes skip items on tests or survey, are absent on exam day, etc. First de­
scribed by Rubin (1976), it is now well accepted that the treatment of missing
data is contingent on the mechanism that caused the data to be missing. Data that
is missing completely at random (MCAR) is entirely unsystematic and not related
to any other value on any measured variable. For example, a person may acci­
dently skip one question on a test. Data missing at random (MAR), contrary to its
label, is not missing at random. Rather, it is a situation where the missingness can
be fully accounted for by the remainder of the data. For instance, nonresponse to
self-esteem questions might be related to other questions, such as gender and age.
Copyright © 2020. Taylor & Francis Group. All rights reserved.

Finally, missing not at random (MNAR) applies when the missing data is related
to the reason that it is missing. For example, an anxious person might not respond
to survey questions dealing with anxiety.
It is useful to look for patterns at both variable and participant levels when
considering missing data (Fernstad, 2019). For example, the first variable in the
first illustration in Figure 8.8 seems to be problematic whereas the final partici­
pant is problematic in the second illustration. The third illustration reveals rela­
tively random missingness. Generally, randomly scattered missing data is less
problematic than other patterns (Tabachnick & Fidell, 2019).
Researchers tend to rely on two general approaches to dealing with missing
data: discard a portion of the data or replace missing values with estimated or
imputed values. When discarding data, the entire case can be discarded if one or

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 53

FIGURE 8.8 Missing Data Patterns

more of its data values are missing (listwise deletion). Alternatively, only the
actual missing values can be discarded (pairwise deletion). Most statistical pro­
grams offer these two missing data methods. Both methods will reduce power
and may result in statistical estimation problems and biased parameter estimates
(Zygmont & Smith, 2014).
A wide variety of methods have been developed to estimate or impute missing
data values (Hair et al., 2019; Roth, 1994; Tabachnick & Fidell, 2019) that range
from simple (replace missing values with the mean value of that variable) to more
complex (predict the missing data value using nonmissing values via regression
analysis) to extremely complex (multiple imputation and maximum likelihood
estimation). Baraldi and Enders (2013) suggested that “researchers must formulate
logical arguments that support a particular missing data mechanism and choose an
analysis method that is most defensible, given their assumptions about missing­
ness” (p. 639).
Unfortunately, there is no infallible way to statistically verify the missing data
mechanism and most methods used to deal with missing data values rely, at a
minimum, on the assumption of MAR. Considerable simulation research has
suggested that the amount of missing data may be a practical guide to dealing
with missing data. In general, if less than 5% to 10% of the data are missing in a
Copyright © 2020. Taylor & Francis Group. All rights reserved.

random pattern across variables and participants then any method of deletion or
imputation will be acceptable (Chen et al., 2012; Hair et al., 2019; Lee &
Ashton, 2007; Roth, 1994; Tabachnick & Fidell, 2019; Xiao et al., 2019).
When more than 10% of the data are missing, Newman (2014) suggested that
complex multivariate techniques, such as multiple imputation or maximum
likelihood estimation, be used. As with outliers, extensive missing data requires
a sensitivity analysis where the EFA results from different methods of dealing
with missing data are compared for robustness (Goldberg & Velicer, 2006; Hair
et al., 2019; Tabachnick & Fidell, 2019). Additionally, the amount and location
of missing data at variable and participant levels should be transparently
reported.
Missing Data in R. R is relative inflexible in its use of missing data
indicators in data files. The only recognizable missing data indicator is the

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
54 Step 3

two-letter NA combination. Other statistical packages allow the user to select


one or more indicators and many researchers have developed habits in this
regard. For example, assigning -9 to missing values without apparent cause,
-99 to missing values where the survey respondent refused to answer, and
-999 when the question did not apply. Data with missing data indicators other
than NA can be edited before they are analyzed in R. This can be accom­
plished in EXCEL, SPSS, SAS, Stata, etc. or by using software such as
StatTransfer (https://stattransfer.com) that automatically “translates” between
more than 30 file formats. Alternatively, the import data window in RStudio
allows the specification of a missing data indicator (that specification was il­
lustrated with -9 in the Importing Raw Data section).
Given this inflexibility, it is important that data be carefully screened to verify
that missing values are appropriately indicated and handled. The iq data set does
not contain any missing data, but a version of that data set was created with 10
random missing values (all indicated with -999) and imported via the menu se­
quence of File > Import Dataset > From Excel to illustrate the use of missing data
commands in R. Note that the -999 indicators have been converted to NA in the
new data frame object.

BOX 8.11 AS.DATA.FRAME AND ATTACH COMMANDS

The optional missing data command in many R functions allows either


pairwise or listwise deletion. For comparison, the mean of the vocab1 variable in
the original data was 97.500.

BOX 8.12 LISTWISE AND PAIRWISE DELETION OF MISSING


Copyright © 2020. Taylor & Francis Group. All rights reserved.

DATA VALUES

It may be easier to delete all missing values from the data file before con­
ducting an analysis.

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 55

BOX 8.13 COUNT AND DISPLAY CASES WITH MISSING DATA


.

The psych package contains a variety of missing data procedures that may be
more useful for EFA than the native R procedures for handling missing data.

BOX 8.14 IMPUTE MISSING DATA VALUES WITH MAXIMUM


LIKELIHOOD METHOD FROM PSYCH PACKAGE
Copyright © 2020. Taylor & Francis Group. All rights reserved.

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
56 Step 3
If desired, the imputed correlation matrix may be used in future EFA analyses
in place of the original raw data. Alternatively, the raw data can be submitted to
the EFA procedure with the specification that mean or median values be imputed
for missing values.
A visual depiction of the data set may also be useful in recognizing the extent
and pattern of missing data values. This can be accomplished with the Amelia
package (Figure 8.9).
Copyright © 2020. Taylor & Francis Group. All rights reserved.

FIGURE 8.9 Missingness Map

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
Step 3 57

BOX 8.15 INSTALL AMELIA PACKAGE FOR MISSING DATA

More complex presentations of missing data can be obtained from the VIM
package.

BOX 8.16 INSTALL VIM PACKAGE FOR MISSING DATA

As illustrated in the VIM output in Figure 8.10, 93.42% of the values are NOT
missing. At the measured variable level, vocab1 is missing 1.32% of its values
whereas veranal2 is missing .66% of its values, and vocab2 is missing 0% of its
values.
Copyright © 2020. Taylor & Francis Group. All rights reserved.

FIGURE 8.10 Missing Data Display from VIM Package

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.
58 Step 3

Currently, maximum likelihood and multiple imputation are the most ap­
propriate methods to apply when there is more than a trivial amount of missing
data (Enders, 2017). The maximum likelihood estimation method within the
psych package, as illustrated previously, is one option available in R. There are
several other R packages that might be employed if a multiple imputation
method is desired. These include the Amelia, MICE, and mitml packages. See
Enders (2017) for a tutorial on multiple imputation with the mitml package.

Report
Scatterplots revealed that linear relationships exist between the variables.
Measures of univariate and multivariate normality indicated an underlying normal
data distribution (Curran et al., 1996; Finney & DiStefano, 2013; Mardia, 1970).
There was no evidence that restriction of range or outliers substantially affected
the scores and there was no missing data. Therefore, a Pearson product-moment
correlation matrix was submitted for EFA.
Copyright © 2020. Taylor & Francis Group. All rights reserved.

Watkins, Marley. A Step-By-Step Guide to Exploratory Factor Analysis with R and RStudio, Taylor & Francis Group, 2020.
ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ed/detail.action?docID=6413870.
Created from ed on 2024-08-08 21:26:53.

You might also like