Multiple Imputation Method
Multiple Imputation Method
The MI Procedure
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
137
138
141
141
142
143
149
150
151
DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Descriptive Statistics . . . . . . . . . . . . . . . . . . . . .
EM Algorithm for Data with Missing Values . . . . . . . . .
Statistical Assumptions for Multiple Imputation . . . . . . .
Missing Data Patterns . . . . . . . . . . . . . . . . . . . . .
Imputation Mechanisms . . . . . . . . . . . . . . . . . . .
Regression Method for Monotone Missing Data . . . . . . .
Propensity Score Method for Monotone Missing Data . . . .
MCMC Method for Arbitrary Missing Data . . . . . . . . .
Producing Monotone Missingness with the MCMC Method .
MCMC Method Specifications . . . . . . . . . . . . . . . .
Convergence in MCMC . . . . . . . . . . . . . . . . . . . .
Input Data Sets . . . . . . . . . . . . . . . . . . . . . . . .
Output Data Sets . . . . . . . . . . . . . . . . . . . . . . .
Combining Inferences from Multiply Imputed Data Sets . .
Multiple Imputation Efficiency . . . . . . . . . . . . . . . .
Imputers Model Versus Analysts Model . . . . . . . . . .
Parameter Simulation Versus Multiple Imputation . . . . . .
ODS Table Names . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
152
152
153
154
155
156
157
158
159
164
166
167
170
171
173
174
174
175
176
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Example 9.1 EM Algorithm for MLE . . . . . . . . . . . . . . . . . . . . . 177
130
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
181
184
185
188
191
194
198
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Chapter 9
The MI Procedure
Overview
The experimental MI procedure performs multiple imputation of missing data. Missing values are an issue in a substantial number of statistical analyses. Most SAS
statistical procedures exclude observations with any missing variable values from
the analysis. These observations are called incomplete cases. While analyzing only
complete cases has its simplicity, the information contained in the incomplete cases
is lost. This approach also ignores possible systematic differences between the complete cases and the incomplete cases, and the resulting inference may not be applicable to the population of all cases, especially with a smaller number of complete
cases.
Some SAS procedures use all the available cases in an analysis, that is, cases with
available information. For example, the CORR procedure estimates a variable mean
by using all cases with nonmissing values for this variable, ignoring the possible
missing values in other variables. PROC CORR also estimates a correlation by using
all cases with nonmissing values for this pair of variables. This makes better use of
the available data, but the resulting correlation matrix may not be positive definite.
Another strategy for handling missing data is simple imputation, which substitutes a
value for each missing value. Standard statistical procedures for complete data analysis can then be used with the filled-in data set. For example, each missing value
can be imputed with the variable mean of the complete cases, or it can be imputed
with the mean conditional on observed values of other variables. This approach treats
missing values as if they were known in the complete-data analysis. However, single imputation does not reflect the uncertainty about the predictions of the unknown
missing values, and the resulting estimated variances of the parameter estimates will
be biased toward zero (Rubin 1987, p. 13).
Instead of filling in a single value for each missing value, multiple imputation (Rubin
1976; 1987) replaces each missing value with a set of plausible values that represent
the uncertainty about the right value to impute. The multiply imputed data sets are
then analyzed by using standard procedures for complete data and combining the
results from these analyses. No matter which complete-data analysis is used, the
process of combining results from different data sets is essentially the same.
Multiple imputation does not attempt to estimate each missing value through simulated values but rather to represent a random sample of the missing values. This
process results in valid statistical inferences that properly reflect the uncertainty due
to missing values; for example, confidence intervals with the correct probability coverage.
132
Getting Started
133
Getting Started
Consider the following Fitness data set that has been altered to contain an arbitrary
pattern of missingness:
*----------------- Data on Physical Fitness -----------------*
| These measurements were made on men involved in a physical |
| fitness course at N.C. State University.
|
| Only selected variables of
|
| Oxygen (oxygen intake, ml per kg body weight per minute), |
| Runtime (time to run 1.5 miles in minutes), and
|
| RunPulse (heart rate while running) are used.
|
| Certain values were changed to missing for the analysis.
|
*------------------------------------------------------------*;
data FitMiss;
input Oxygen RunTime RunPulse @@;
datalines;
44.609 11.37 178
45.313 10.07
54.297
8.65 156
59.571
.
49.874
9.22
.
44.811 11.63
.
11.95 176
. 10.85
39.442 13.08 174
60.055
8.63
50.541
.
.
37.388 14.03
44.754 11.12 176
47.273
.
51.855 10.33 166
49.156
8.95
40.836 10.95 168
46.672 10.00
46.774 10.25
.
50.388 10.08
39.407 12.63 174
46.080 11.17
45.441
9.63 164
.
8.92
45.118 11.08
.
39.203 12.88
45.790 10.47 186
50.545
9.93
48.673
9.40 186
47.920 11.50
47.467 10.50 170
;
185
.
176
.
170
186
.
180
.
168
156
.
168
148
170
Suppose that the data are multivariate normally distributed and the missing data are
missing at random (MAR). That is, the probability that an observation is missing
can depend on the observed variable values of the individual, but not on the missing variable values of the individual. See the Statistical Assumptions for Multiple
Imputation section on page 154 for a detailed description of the MAR assumption.
The following statements invoke the MI procedure and impute missing values for the
FitMiss data set.
proc mi data=FitMiss seed=37851 mu0=50 10 180 out=outmi;
var Oxygen RunTime RunPulse;
run;
134
The MI Procedure
Model Information
Data Set
Method
Multiple Imputation Chain
Initial Estimates for MCMC
Start
Prior
Number of Imputations
Number of Burn-in Iterations
Number of Iterations
Seed for random number generator
Figure 9.1.
WORK.FITMISS
MCMC
Single Chain
EM Posterior Mode
Starting Value
Jeffreys
5
200
100
37851
Model Information
The Model Information table describes the method used in the multiple imputation
process. By default, the procedure uses the Markov Chain Monte Carlo (MCMC)
method with a single chain to create five imputations. The posterior mode, the highest
observed-data posterior density, with a noninformative prior, is computed from the
EM algorithm and is used as the starting value for the chain.
The MI procedure takes 200 burn-in iterations before the first imputation and 100
iterations between imputations. In a Markov chain, the information in the current
iteration has influence on the state of the next iteration. The burn-in iterations are
iterations in the beginning of each chain that are used both to eliminate the series of
dependence on the starting value of the chain and to achieve the stationary distribution. The between-imputation iterations in a single chain are used to eliminate the
series of dependence between the two imputations.
The MI Procedure
Missing Data Patterns
Group
1
2
3
4
5
Oxygen
Run
Time
Run
Pulse
X
X
X
.
.
X
X
.
X
X
X
.
.
X
.
Freq
Percent
21
4
3
1
2
67.74
12.90
9.68
3.23
6.45
Group
1
2
3
4
5
Figure 9.2.
46.353810
47.109500
52.461667
.
.
10.809524
10.137500
.
11.950000
9.885000
171.666667
.
.
176.000000
.
Getting Started
135
The Missing Data Patterns table lists distinct missing data patterns with corresponding frequencies and percents. Here, an X means that the variable is observed
in the corresponding group and a . means that the variable is missing. The table also
displays group-specific variable means. The MI procedure sorts the data into groups
based on whether an individuals value is observed or missing for each variable to be
analyzed. For a detailed description of missing data patterns, see the Missing Data
Patterns section on page 155.
The MI Procedure
Multiple Imputation Variance Information
Variable
Oxygen
RunTime
RunPulse
-----------------Variance----------------Between
Within
Total
0.045321
0.005853
0.611864
0.937239
0.072217
3.247163
0.991624
0.079241
3.981400
DF
26.113
24.45
19.227
Figure 9.3.
Variable
Relative
Increase
in Variance
Fraction
Missing
Information
Oxygen
RunTime
RunPulse
0.058027
0.097265
0.226116
0.056263
0.092202
0.197941
Variance Information
After the completion of m imputations, the Multiple Imputation Variance Information table displays the between-imputation variance, within-imputation variance,
and total variance for combining complete-data inferences. It also displays the degrees of freedom for the total variance. The relative increase in variance due to
missing values and the fraction of missing information for each variable are also
displayed. A detailed description of these statistics is provided in the Combining
Inferences from Multiply Imputed Data Sets section on page 173.
The following Multiple Imputation Parameter Estimates table displays the estimated mean and standard error of the mean for each variable. The inferences are
based on the t distribution. The table also displays a 95% confidence interval for the
mean and a t statistic with the associated p-value for the hypothesis that the population
mean is equal to the value specified with the MU0= option. A detailed description
of these statistics is provided in the Combining Inferences from Multiply Imputed
Data Sets section on page 173.
136
The MI Procedure
Multiple Imputation Parameter Estimates
Variable
Mean
Std Error
Oxygen
RunTime
RunPulse
47.126919
10.546494
171.621676
0.995803
0.281498
1.995344
49.1734
11.1269
175.7946
DF
26.113
24.45
19.227
Variable
Minimum
Maximum
Mu0
t for H0:
Mean=Mu0
Pr > |t|
Oxygen
RunTime
RunPulse
46.849494
10.464123
170.623678
47.318758
10.669193
172.680679
50.000000
10.000000
180.000000
-2.89
1.94
-4.20
0.0077
0.0638
0.0005
Figure 9.4.
Parameter Estimates
In addition to the output tables, the procedure also creates a data set with imputed
values. The imputed data sets are stored in the outmi data set, with the index variable
Imputation indicating the imputation numbers. The data set can now be analyzed
using standard statistical procedures with Imputation as a BY variable.
The following statements list the first ten observations of data set outmi.
proc print data=outmi (obs=10);
title First 10 Observations of the Imputed Data Set;
run;
Obs
1
2
3
4
5
6
7
8
9
10
Figure 9.5.
_Imputation_
Oxygen
RunTime
Run
Pulse
1
1
1
1
1
1
1
1
1
1
44.6090
45.3130
54.2970
59.5710
49.8740
44.8110
46.0264
42.3040
39.4420
60.0550
11.3700
10.0700
8.6500
6.1569
9.2200
11.6300
11.9500
10.8500
13.0800
8.6300
178.000
185.000
156.000
138.583
164.163
176.000
176.000
182.486
174.000
170.000
The table shows that the precision of the imputed values differs from the precision of
the observed values. You can use the ROUND= option to make the imputed values
consistent with the observed values.
Syntax
137
Syntax
The following statements are available in PROC MI.
138
PROC MI Statement
PROC MI < options > ;
The following table summarizes the options available in the PROC MI statement.
Table 9.1.
Tasks
Options
DATA=
OUT=
NIMPUTE=
SEED=
ROUND=
MAXIMUM=
MINIMUM=
SINGULAR=
ALPHA=
MU0=
NOPRINT
SIMPLE
The following options can be used in the PROC MI statement (in alphabetical order):
ALPHA=
specifies that confidence limits be constructed for the mean estimates with confidence
level 100(1 )%, where 0 < < 1. The default is ALPHA=0.05.
DATA=SAS-data-set
names the SAS data set to be analyzed by PROC MI. By default, the procedure uses
the most recently created SAS data set.
MAXIMUM=numbers
specifies maximum values for imputed variables. When an intended imputed value
is greater than the maximum, PROC MI redraws another value for imputation. If
only one number is specified, that number is used for all variables. If more than one
number is specified, you must use a VAR statement, and the specified numbers must
correspond to variables in the VAR statement. A missing value indicates no restriction on the maximum for the corresponding variable. The default is MAXIMUM=. ,
no restriction on the maximum.
PROC MI Statement
139
specifies the minimum values for imputed variables. When an intended imputed value
is less than the minimum, PROC MI redraws another value for imputation. If only one
number is specified, that number is used for all variables. If more than one number is
specified, you must use a VAR statement, and the specified numbers must correspond
to variables in the VAR statement. A missing value indicates no restriction on the
minimum for the corresponding variable. The default is MINIMUM=. , no restriction
on the minimum.
MU0=numbers
THETA0=numbers
specifies the parameter values 0 under the null hypothesis = 0 for the population
means corresponding to the analysis variables. Each hypothesis is tested with a t test.
If only one number is specified, that number is used for all variables. If more than
one number is specified, you must use a VAR statement, and the specified numbers
must correspond to variables in the VAR statement. The default is MU0=0.
If a variable is transformed as specified in a TRANSFORM statement, then the same
transformation for that variable is also applied to its corresponding specified MU0=
value in the t test. If the parameter values 0 for a transformed variable is not specified, then 0 = 0 is used for that transformed variable.
NIMPUTE=number
specifies the number of imputations. The default is NIMPUTE=5. You can specify
NIMPUTE=0 to skip the imputation. In this case, only tables of model information,
missing data patterns, descriptive statistics (SIMPLE option), and MLE from the EM
algorithm (EM statement) are displayed.
NOPRINT
suppresses the display of all output. Note that this option temporarily disables the
Output Delivery System (ODS). For more information, refer to the chapter Using
the Output Delivery System in the SAS/STAT Users Guide, Version 8.
OUT=SAS-data-set
creates an output SAS data set containing imputation results. The data set includes
an index variable, Imputation , to identify the imputation number. For each imputation, the data set contains all variables in the input data set with missing values
replaced by the imputed values. See the Output Data Sets section on page 171 for
a description of this data set.
140
ROUND=numbers
specifies the units to round variables in the imputation. If only one number is specified, that number is used for all variables. If more than one number is specified, you
must use a VAR statement, and the specified numbers must correspond to variables
in the VAR statement. The default number is a missing value, which indicates no
rounding for imputed variables.
When specifying a roundoff unit for the first variable only, you must also specify a
missing value after the roundoff unit. Otherwise, the roundoff unit is used for all
variables. For example, the option ROUND= 10 . sets a roundoff unit of 10 for the
first analysis variable only and no rounding for the remaining variables. The option
ROUND= . 10 sets a roundoff unit of 10 for the second analysis variable only and
no rounding for other variables.
You can use the ROUND= option to set the precision of imputed values. For example, with a roundoff unit of 0.001, each value is rounded to the nearest multiple of
0.001. That is, each value has three significant digits after the decimal point. See
Example 9.3 for a usage of this option.
SEED=number
specifies a positive integer. PROC MI uses the value of the SEED= option to start
the pseudo-random number generator. The default is a value generated from reading
the time of day from the computers clock. However, in order to duplicate the results
under identical situations, you must control the value of the seed explicitly rather than
rely on the clock reading.
The seed information is displayed in the Model Information table so that the results
can be reproduced by specifying this seed with the SEED= option. You need to
specify the same seed number in the future to reproduce the results.
SIMPLE
displays simple descriptive univariate statistics and pairwise correlations from available cases. For a detailed description of these statistics, see the Descriptive
Statistics section on page 152.
SINGULAR=p
specifies the criterion for determining the singularity of a covariance matrix, where
0 < p < 1. The default is SINGULAR=1E 8.
Suppose that S is a covariance matrix and v is the number of variables in S. Based on
the spectral decomposition S = 0 , where is a diagonal matrix of eigenvalues
j , j = 1; : : :, v, where i j when i < j , and is a matrix with the corresponding orthonormal eigenvectors of S as columns, S is
singular when
Pconsidered
v
an eigenvalue j is less than p, where the average = k=1 k =v .
EM Statement
141
BY Statement
BY variables ;
You can specify a BY statement with PROC MI to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the
procedure expects the input data set to be sorted in order of the BY variables.
If your input data set is not sorted in ascending order, use one of the following alternatives:
Sort the data using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY
statement for the MI procedure. The NOTSORTED option does not mean that
the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in
alphabetical or increasing numeric order.
Create an index on the BY variables using the DATASETS procedure.
For more information on the BY statement, refer to the discussion in SAS Language
Reference: Concepts, Version 8. For more information on the DATASETS procedure,
refer to the discussion in the SAS Procedures Guide, Version 8.
EM Statement
EM < options > ;
The expectation-maximization (EM) algorithm is a technique for maximum likelihood estimation in parametric models for incomplete data. The EM statement uses
the EM algorithm to compute the MLE for (; ), the means and covariance matrix, of a multivariate normal distribution from the input data set with missing values.
PROC MI uses the means and standard deviations from available cases as the initial
estimates for the EM algorithm. The correlations are set to zero.
You can also use the EM statement with the NIMPUTE=0 option in the PROC statement to compute the EM estimates without multiple imputation, as shown in Example 9.1 in the Examples section on page 177.
142
sets the convergence criterion. The value must be between 0 and 1. The iterations are
considered to have converged when the maximum change in the parameter estimates
between iteration steps is less than the value specified. The change is a relative change
if the parameter is greater than 0.01 in absolute value; otherwise, it is an absolute
change. By default, CONVERGE=1E-4.
ITPRINT
specifies the maximum number of iterations used in the EM algorithm. The default
is MAXITER=200.
OUTEM=SAS-data-set
creates an output SAS data set of TYPE=COV containing the MLE of the parameter
vector (; ). These estimates are computed with the EM algorithm. See the Output
Data Sets section on page 171 for a description of this output data set.
OUTITER < ( options ) > =SAS-data-set
creates an output SAS data set of TYPE=COV containing parameters for each iteration. The data set includes a variable named Iteration to identify the iteration
number.
The parameters in the output data set depend on the options specified. You can specify
the MEAN and COV options to output the mean and covariance parameters. When
no options are specified, the output data set contains the mean parameters for each
iteration. See the Output Data Sets section on page 171 for a description of this
data set.
FREQ Statement
FREQ variable ;
If one variable in your input data set represents the frequency of occurrence for other
values in the observation, specify the variable name in a FREQ statement. PROC MI
then treats the data set as if each observation appears n times, where n is the value of
the FREQ variable for the observation. If the value of the FREQ variable is less than
one, the observation is not used in the analysis. Only the integer portion of the value
is used. The total number of observations is considered to be equal to the sum of the
FREQ variable when PROC MI calculates significance probabilities.
MCMC Statement
143
MCMC Statement
MCMC < options > ;
The MCMC statement specifies the details of the MCMC method for imputation. The
following table summarizes the options available for the MCMC statement.
Table 9.2.
Tasks
Options
INEST=
OUTEST=
OUTITER=
IMPUTE=
CHAIN=
NBITER=
NITER=
INITIAL=
PRIOR=
START=
TIMEPLOT=
ACFPLOT=
GOUT=
WLF
DISPLAYINIT
The following are the options available for the MCMC statement (in alphabetical
order):
ACFPLOT < ( options < / display-options > ) >
displays plots of variances for variables in the list and covariances for pairs
of variables in the list. When the option COV is specified without variables,
variances for all variables and covariances for all pairs of variables are used.
144
displays plots of means for variables in the list. When the option MEAN is
specified without variables, all variables are used.
WLF
The default is
CFRAME=color
specifies the color for filling the area enclosed by the axes and the frame. By
default, this area is not filled.
CNEEDLES=color
specifies the color of the vertical line segments (needles) that connect autocorrelations to the reference line. The default is CNEEDLES=BLACK.
CREF=color
The default is
CSYMBOL=color
HSYMBOL=number
specifies the height for data points in percentage screen units. The default is
HSYMBOL=1.
LCONF=linetype
specifies the line type for the displayed confidence limits. The default is
LREF=1, a solid line.
LOG
specifies the line type for the displayed reference line. The default is LREF=3,
a dashed line.
NLAG=number
specifies the maximum lag of the series. The default is NLAG=20. The autocorrelations at each lag are displayed in the graph.
MCMC Statement
145
SYMBOL=value
specifies the symbol for data points in percentage screen units. The default is
SYMBOL=STAR.
TITLE=string
specifies the width for the displayed confidence limits in percentage screen
units. If you specify the WCONF=0 option, the confidence limits are not displayed. The default is WCONF=1.
WNEEDLES=number
specifies the width for the displayed needles that connect autocorrelations to
the reference line in percentage screen units. If you specify the WNEEDLES=0
option, the needles are not displayed. The default is WNEEDLES=1.
WREF=number
specifies the width for the displayed reference line in percentage screen units.
If you specify the WREF=0 option, the reference line is not displayed. The
default is WREF=1.
For example, the statement
acfplot( mean( y1) cov(y1) /log);
requests autocorrelation function plots for the means and variances of the variable y1, respectively. Logarithmic transformations of both the means and variances are used in the plots. For a detailed description of the autocorrelation
function plot, see the Autocorrelation Function Plot section on page 169; refer also to Schafer (1997, pp. 120-126) and the SAS/ETS Users Guide, Version
8.
CHAIN=SINGLE | MULTIPLE
specifies whether a single chain is used for all imputations or a separate chain is used
for each imputation. The default is CHAIN=SINGLE.
DISPLAYINIT
displays initial parameter values in the MCMC process for each imputation.
GOUT=graphics-catalog
specifies the graphics catalog for saving graphics output from PROC MI. The default is WORK.GSEG. For more information, refer to the chapter The GREPLAY
Procedure in SAS/GRAPH Software: Reference, Version 8.
IMPUTE=FULL | MONOTONE
specifies whether a full-data imputation is used for all missing values or a monotonedata imputation is used for a subset of missing values to make the imputed data sets
have a monotone missing pattern. The default is IMPUTE=FULL. When
IMPUTE=MONOTONE is specified, the order in the VAR statement is used to complete the monotone pattern.
146
INEST=SAS-data-set
names a SAS data set of TYPE=EST containing parameter estimates for imputations.
These estimates are used to impute values for observations in the DATA= data set.
A detailed description of the data set is provided in the Input Data Sets section on
page 170.
INITIAL=EM < ( options ) >
INITIAL=INPUT=SAS-data-set
specifies the initial mean and covariance estimates for the MCMC process. The default is INITIAL=EM.
You can specify INITIAL=INPUT=SAS-data-set to read the initial estimates of the
mean and covariance matrix for each imputation from a SAS data set. See the Input
Data Sets section on page 170 for a description of this data set.
With INITIAL=EM, PROC MI derives parameter estimates for a posterior mode,
the highest observed-data posterior density, from the EM algorithm. The MLE from
EM is used to start the EM algorithm for the posterior mode, and the resulting EM
estimates are used to begin the MCMC process.
The following four options are available with INITIAL=EM.
BOOTSTRAP < =number >
requests bootstrap resampling, which uses a simple random sample with replacement from the input data set for the initial estimate. You can explicitly
specify the number of observations in the random sample. Alternatively, you
can implicitly specify the number of observations in the random sample by
specifying the proportion p; 0 < p <= 1, to request [np] observations in the
random sample, where n is the number of observations in the data set and [np]
is the integer part of np. This produces an overdispersed initial estimate that
provides different starting values for the MCMC process. If you specify the
BOOTSTRAP option without the number, p=0.75 is used by default.
CONVERGE=p
sets the convergence criterion. The value must be between 0 and 1. The iterations are considered to have converged when the maximum change in the
parameter estimates between iteration steps is less than the value specified. The
change is a relative change if the parameter is greater than 0.01 in absolute
value; otherwise, it is an absolute change. By default, CONVERGE=1E-4.
ITPRINT
prints the iteration history in the EM algorithm for the posterior mode.
MAXITER=number
MCMC Statement
147
NBITER=number
specifies the number of burn-in iterations before the first imputation in each chain.
The default is NBITER=200.
NITER=number
specifies the number of iterations between imputations in a single chain. The default
is NITER=100.
OUTEST=SAS-data-set
creates an output SAS data set of TYPE=EST. The data set contains parameter
estimates used in each imputation. The data set also includes a variable named
Imputation to identify the imputation number. See the Output Data Sets section
on page 171 for a description of this data set.
OUTITER < ( options ) > =SAS-data-set
creates an output SAS data set of TYPE=COV containing parameters used in the imputation step for each iteration. The data set includes variables named Imputation
and Iteration to identify the imputation number and iteration number.
The parameters in the output data set depend on the options specified. You can specify options MEAN, STD, COV, LR, LR POST, and WLF to output parameters of
means, standard deviations, covariances, -2 log LR statistic, -2 log LR statistic of the
posterior mode, and the worst linear function. When no options are specified, the
output data set contains the mean parameters used in the imputation step for each
iteration. See the Output Data Sets section on page 171 for a description of this
data set.
PRIOR=name
specifies the prior information for the means and covariances. Valid values for name
are as follows:
JEFFREYS
RIDGE=number
INPUT=SAS-data-set
For a detailed description of the prior information, see the Bayesian Estimation of
the Mean Vector and Covariance Matrix section on page 161 and the Posterior
Step section on page 162. If you do not specify the PRIOR= option, the default is
PRIOR=JEFFREYS.
The PRIOR=INPUT= option specifies a TYPE=COV data set from which the prior
information of the mean vector and the covariance matrix is read. See the Input Data
Sets section on page 170 for a description of this data set.
START=VALUE | DIST
specifies that the initial parameter estimates are used as either the starting value
(START=VALUE) or as the starting distribution (START=DIST) in the first imputation step of each chain. The default is START=VALUE.
148
displays plots of variances for variables in the list and covariances for pairs
of variables in the list. When the option COV is specified without variables,
variances for all variables and covariances for all pairs of variables are used.
MEAN < ( variables ) >
displays plots of means for variables in the list. When the option MEAN is
specified without variables, all variables are used.
WLF
specifies the color for filling the area enclosed by the axes and the frame. By
default, this area is not filled.
CSYMBOL=color
specifies the color of the data points to be displayed in the time-series plots.
The default is CSYMBOL=BLACK.
HSYMBOL=number
specifies the height for data points in percentage screen units. The default is
HSYMBOL=1.
LOG
requests that the logarithmic transformations of parameters be used. Its generally used for the variances of variables. When a parameter value is less than or
equal to zero, the value is not displayed in the corresponding plot.
SYMBOL=value
specifies the symbol for data points in percentage screen units. The default is
SYMBOL=PLUS.
TITLE=string
The default is
For a detailed description of the time-series plot, see the Time-Series Plot section
on page 168 and Schafer (1997, pp. 120126).
MONOTONE Statement
149
WLF
displays the worst linear function of parameters. This scalar function of parameters
and is worst in the sense that its values from iterations converge most slowly
among parameters. For a detailed description of this statistic, see the Worst Linear
Function of Parameters section on page 168.
MONOTONE Statement
MONOTONE < options > ;
The MONOTONE statement specifies an imputation method for data sets with monotone missingness. You must also specify a VAR statement and the data set must have a
monotone missing pattern with variables ordered in the VAR list. When both MONOTONE and MCMC statements are specified, the MONOTONE statement is not used..
You can specify the following options in a MONOTONE statement.
METHOD=REG | REGRESSION
METHOD=PROPENSITY < = NGROUPS = number>
specifies the imputation method for a data set with a monotone missing pattern. You can specify either METHOD=REG, a parametric regression method, or
METHOD=PROPENSITY, a nonparametric method based on propensity scores.
The default is METHOD=REG.
When METHOD=PROPENSITY is specified, the MAXIMUM=, MINIMUM=, and
ROUND= options, which make the imputed values more consistent with the observed
variable values, are not applicable.
NGROUPS=number
150
TRANSFORM Statement
TRANSFORM transform ( variables < / options >)
< : : : transform ( variables < / options >) > ;
The TRANSFORM statement lists the transformations and their associated variables
to be transformed. The options are transformation options that provide additional
information for the transformation.
The MI procedure assumes that the data are from a multivariate normal distribution
when either the regression method or the MCMC method is used. When some variables in a data set are clearly non-normal, it is useful to transform these variables to
conform to the multivariate normality assumption. With a TRANSFORM statement,
variables are transformed before the imputation process and these transformed variable values are displayed in all of the results. When you specify an OUT= option, the
variable values are reverse-transformed to create the imputed data set.
The following transformations can be used as the transform in the TRANSFORM
statement.
BOXCOX
VAR Statement
151
The following options provide the constant c and values in the transformations.
C=number
specifies the c value in the transformation. The default is c = 1 for logit transformation and c = 0 for other transformations.
LAMBDA=number
specifies the value in the power and Box-Cox transformations. You must specify
the value for these two transformations.
For example, the statement
transform log(y1) power(y2/c=1 lambda=.5);
requests that variables log(y1), a logarithmic transformation for the variable y1, and
py2 + 1, a power transformation for the variable y2, be used in the imputation.
If the MU0= option is used to specify a parameter value 0 for a transformed variable,
the same transformation for the variable is also applied to its corresponding MU0=
value in the t test. Otherwise, 0 = 0 is used for the transformed variable. See
Example 9.7 for a usage of the TRANSFORM statement.
VAR Statement
VAR variables ;
The VAR statement lists the variables to be analyzed. The variables must be numeric. If you omit the VAR statement, all numeric variables not mentioned in other
statements are used. The VAR statement is required if you specify a MONOTONE
statement, an IMPUTE=MONOTONE option in the MCMC statement, or more than
one number in the MU0=, MAXIMUM=, MINIMUM=, or ROUND= option.
152
Details
Descriptive Statistics
Suppose Y is the np matrix of complete data, which may not be fully observed,
n0 is the number of observations fully observed, and nj is the number of observations
with observed values for variable Yj .
With complete cases, the sample mean vector is
X
y = n1
yi
0
S= n 1 1
0
R = D 1S D
where D is a diagonal matrix whose diagonal elements are the square roots of the
diagonal elements of S.
With available cases, the corrected sum of squares for variable Yj is
(yji yj )2
P
where y j = n1j
yji is the sample mean and each summation is over observations
with observed values for variable Yj .
The variance is
s2jj =
nj
(yji yj )2
The correlations for available cases contain pairwise correlations for each pair of
variables. Each correlation is computed from all observations that have nonmissing
values for the corresponding pair of variables.
SAS OnlineDoc: Version 8
153
ln L(jYobs ) =
G
X
g=1
ln Lg ( jYobs )
where ln Lg ( jYobs ) is the observed-data log likelihood from the gth group, and
1 X (y )0 1 (y )
2 ig ig g g ig g
where ng is the number of observations in the gth group, the summation is over
observations in the gth group, yig is a vector of observed values corresponding to
observed variables, g is the corresponding mean vector, and g is the associated
covariance matrix.
Refer to Schafer (1997, pp. 163181) for a detailed description of the EM algorithm
for multivariate normal data.
PROC MI uses the means and standard deviations from available cases as the initial
estimates for the EM algorithm. The correlations are set to zero. For a discussion of
suggested starting values for the algorithm, see Schafer (1997, p. 169).
You can specify the convergence criterion with the CONVERGE= option in the EM
statement. The iterations are considered to have converged when the maximum
change in the parameter estimates between iteration steps is less than the value specified. You can also specify the maximum number of iterations used in the EM algorithm with the MAXITER= option.
SAS OnlineDoc: Version 8
154
155
Figure 9.6.
Group
Y1
Y2
Y3
1
2
3
4
5
6
7
8
X
X
X
X
.
.
.
.
X
X
.
.
X
X
.
.
X
.
X
.
X
.
X
.
Here, an X means that the variable is observed in the corresponding group and a
. means that the variable is missing.
The variable order is used to derive the order of the groups from the data set, and thus
determines the order of missing values in the data to be imputed. If you specify a
different order of variables in the VAR statement, then the results are different even
if the other specifications remain the same.
A data set with variables Y1 , Y2 , ..., Yp (in that order) is said to have a monotone
missing pattern when the event that a variable Yj is missing for a particular individual implies that all subsequent variables Yk , k > j , are missing for that individual.
Alternatively, when a variable Yj is observed for a particular individual, it is assumed
that all previous variables Yk , k < j , are also observed for that individual.
For example, the following figure displays a data set of three variables with a monotone missing pattern. Note that this data set does not have any observations with
missing patterns such as in Groups 3, 5, 6, 7, or 8 in the previous example.
Monotone Missing Data Patterns
Figure 9.7.
Group
Y1
Y2
Y3
1
2
3
X
X
X
X
X
.
X
.
.
156
Imputation Mechanisms
This section describes the three methods for multiple imputation that are available in
the MI procedure. The method of choice depends on the patterns of missingness in
the data.
For data sets with monotone missing patterns, either a parametric regression
method (Rubin 1987) that assumes multivariate normality or a nonparametric
method that uses propensity scores (Rubin 1987; Lavori, Dawson, and Shera
1995) is appropriate.
For data sets with arbitrary missing patterns, a Markov Chain Monte Carlo
(MCMC) method (Schafer 1997) that assumes multivariate normality is used
to impute either all missing values or just enough missing values to make the
imputed data sets have monotone missing patterns.
With a monotone missing data pattern, you have greater flexibility in your choice of
strategies. For example, in addition to the MCMC method, you can also implement
other methods, such as a regression method, that do not use Markov chains.
With an arbitrary missing data pattern, you can often use the MCMC method, which
creates multiple imputations by drawing simulations from a Bayesian predictive distribution for normal data. Another way to handle a data set with an arbitrary missing data pattern is to use the MCMC approach to impute enough values to make
the missing data pattern monotone. Then, you can use a more flexible imputation
method. This approach is described in the Producing Monotone Missingness with
the MCMC Method section on page 164.
Although the regression and MCMC methods assume multivariate normality, inferences based on multiple imputation can be robust to departures from the multivariate
normality if the amount of missing information is not large. It often makes sense
to use a normal model to create multiple imputations even when the observed data
are somewhat non-normal, as supported by simulation studies described in Schafer
(1997) and the original references therein.
You can also use a TRANSFORM statement to transform variables to conform to the
multivariate normality assumption. With a TRANSFORM statement, variables are
transformed before the imputation process and then are reverse-transformed to create
the imputed data set.
Li (1988) presented an argument for convergence of the MCMC method in the continuous case in theory and used it to create imputations for incomplete multivariate
continuous data. But in practice, it is not easy to check the convergence of a Markov
chain, especially for parameters from a large number of variables. PROC MI generates statistics and plots that you can use to check for convergence of the MCMC process. The details are described in the Convergence in MCMC section on page 167.
157
Yj = 0 + 1 Y1 + 2 Y2 + : : : + j 1 Yj
j )=g
0 Z
= ^ + j Vhj
0 is the upper triangular matrix in the Cholesky decomposition,
where Vhj
0 Vhj , and Z is a vector of j independent random normal variates.
Vj = Vhj
The missing values are then replaced by
0 + 1 y1 + 2 y2 + : : : + (j
1)
yj
+ zi j
1 variables and zi is a
158
logit(pj ) = 0 + 1 Y1 + 2 Y2 + : : : + j 1 Yj
where pj
and
3. Create a propensity score for each observation to estimate the probability that it is
missing.
4. Divide the observations into a fixed number of groups (typically assumed to be
five) based on these propensity scores.
5. Apply an approximate Bayesian bootstrap imputation to each group. In group k ,
suppose that Yobs denotes the n1 observations with nonmissing Yj values and Ymis
denotes the n0 observations with missing Yj . The approximate Bayesian bootstrap
imputation first draws n1 observations randomly with replacement from Yobs to create
. This is a nonparametric analogue of drawing parameters from
a new data set Yobs
the posterior predictive distribution of the parameters. The process then draws the n0
.
values for Ymis randomly with replacement from Yobs
Steps 1 through 5 are repeated sequentially for each variable with missing values.
Note that the propensity score method was originally designed for a randomized experiment with repeated measures on the response variables. The goal was to impute
the missing values on the response variables. The method uses only the covariate
information that is associated with whether the imputed variable values are missing.
It does not use correlations among variables. It is effective for inferences about the
distributions of individual imputed variables, such as an univariate analysis, but it is
not appropriate for analyses involving relationship among variables, such as a regression analysis. It can also produce badly biased estimates of regression coefficients
when data on predictor variables are missing (Allison 2000).
159
p(yj )p()
p(jy) = R
p(yj)p()d
MCMC has been applied as a method for exploring posterior distributions in Bayesian
inference. That is, through MCMC, one can simulate the entire joint posterior distribution of the unknown quantities and obtain simulation-based estimates of posterior
parameters that are of interest.
In many incomplete data problems, the observed-data posterior p( jYobs ) is intractable and cannot easily be simulated. However, when Yobs is augmented by
an estimated/simulated value of the missing data Ymis , the complete-data posterior
p(jYobs ; Ymis ) is much easier to simulate. Assuming that the data are from a multivariate normal distribution, data augmentation can be applied to Bayesian inference
with missing data by repeating the following steps:
1. The imputation I-step:
Given an estimated mean vector and covariance matrix, the I-step simulates the missing values for each observation independently. That is, if you denote the variables
with missing values for observation i by Yi(mis) and the variables with observed values by Yi(obs) , then the I-step draws values for Yi(mis) from a conditional distribution
for Yi(mis) given Yi(obs) .
2. The posterior P-step:
Given a complete sample, the P-step simulates the posterior population mean vector
and covariance matrix. These new estimates are then used in the next I-step. Without
prior information about the parameters, a noninformative prior is used. You can also
use other informative priors. For example, a prior information about the covariance
matrix can be helpful to stabilize the inference about the mean vector for a near
singular covariance matrix.
160
Imputation Step
In each iteration, starting with a given mean vector and covariance matrix , the
imputation step draws values for the missing data from the conditional distribution
Ymis given Yobs .
Suppose = [01 ; 02 ]0 is the partitioned mean vector of two sets of variables, Yobs
and Ymis , where 1 is the mean vector for variables Yobs and 2 is the mean vector
for variables Ymis .
Also suppose
=
11 12
012 22
is the partitioned covariance matrix for these variables, where 11 is the covariance
matrix for variables Yobs , 22 is the covariance matrix for variables Ymis , and 12 is
the covariance matrix between variables Yobs and variables Ymis .
By using the sweep operator (Goodnight 1979) on the pivots of the 11 submatrix,
the matrix becomes
111 11112
0
12 111
22:1
where 22:1 = 22 012 111 12 can be used to compute the conditional covariance
matrix of Ymis after controlling for Yobs .
161
For an observation with the preceding missing pattern, the conditional distribution of
Ymis given Yobs = y1 is a multivariate normal distribution with the mean vector
A = Y0Y =
X
i
yi yi0
A=
X
i
1; ).
If A has a Wishart distribution W (n; ), then B = A 1 has an inverted Wishart
distribution W 1 (n; ), where n is the degrees of freedom and = 1 is the
precision matrix (Anderson 1984).
Note that, instead of using the parameter =
tion, Schafer (1997) uses the parameter .
Suppose that each observation in the data matrix Y has a multivariate normal distribution with mean and covariance matrix . Then with a prior inverted Wishart
distribution for and a prior normal distribution for
j
( m; )
1
N 0 ;
W
where > 0 is a fixed number. The posterior distribution (Anderson 1984, p. 270;
Schafer 1997, p. 152) is
jY
j(; Y)
where (n
1
n + m; (n 1)S + +
n
n+
(y 0 )(y
0 )0
1 (ny + ); 1
N
0
n+
n+
162
Posterior Step
In each iteration, the posterior step simulates the posterior population mean vector
and covariance matrix from prior information for and , and the complete
sample estimates.
You can specify the prior parameter information using one of the following methods:
The next four subsections provide details of the posterior step for different prior distributions.
1. A Noninformative Prior
Without prior information about the mean and covariance estimates, a noninformative prior can be used by specifying the PRIOR=JEFFREYS option. The posterior
distributions (Schafer 1997, p. 154) are
(t+1) jY
( n 1; (n 1)S)
1
N y ; (t+1)
n
W
(t+1) j((t+1) ; Y)
j
( d ; d S )
1
N 0 ;
n
W
To obtain the prior distribution for , PROC MI reads the matrix S from observations in the data set with TYPE =COV, and it reads n = d + 1 from observations with TYPE =N.
To obtain the prior distribution for , PROC MI reads the mean vector
from observations with TYPE =MEAN, and it reads n0 from observations with TYPE =N MEAN. When there are no observations with
TYPE =N MEAN, PROC MI reads n0 from observations with TYPE =N.
0
163
(t+1) jY
(t+1) j (t+1) ; Y
W
N
( n + d ; (n 1)S + d S + Sm )
1 (ny + n ); 1 (t+1)
0 0
n + n0
n + n0
where
0
0
Sm = nnn
+ n (y 0 )(y 0 )
0
( d ; d S )
(t+1) jY
( (n 1) + d ; (n 1)S + d S )
1
(n 1) + d ((n 1)S + d S )
(t+1) jY
(t+1) j (t+1) ; Y
( (n 1) + d; (n 1)S + d S )
1 (t+1)
N y ;
n
W
164
4. A Ridge Prior
A special case of the preceding adjustment is a ridge prior with S = Diag S (Schafer
1997, p. 156). That is, S is a diagonal matrix with diagonal elements equal to the
corresponding elements in S.
You can request a ridge prior by using the PRIOR=RIDGE= option. You can explicitly specify the number d 1 in the PRIOR=RIDGE=d option. Or you can implicitly specify the number by specifying the proportion p in the PRIOR=RIDGE=p
option to request d = (n 1)p.
The posterior is then given by
(t+1) jY
(t+1) j (t+1) ; Y
( (n 1) + d ; (n 1)S + d S )
1 (t+1)
N y;
n
165
Imputation Step
The I-step is almost identical to the I-step described in the MCMC Method for Arbitrary Missing Data section on page 159 except that here only a subset of missing
values need to be simulated. To state this precisely, denote the variables with observed values for observation i by Yi(obs) and the variables with missing values by
Yi(mis) = (Yi(m1) Yi(m2) ), where Yi(m1) is a subset of the the missing variables that
will result a monotone missingness when their values are imputed. Then the I-step
draws values for Yi(m1) from a conditional distribution for Yi(m1) given Yi(obs) .
Posterior Step
The P-step is different from the P-step described in the MCMC Method for Arbitrary
Missing Data section on page 159. Instead of simulating the and parameters
from the full imputed data set, the P-step here simulates the and parameters
through simulated regression coefficients from regression models based on the imputed data set with a monotone pattern of missingness. The step is similar to the
process described in the Regression Method for Monotone Missing Data section
on page 157.
That is, for the variable Yj , a model
Yj = 0 + 1 Y1 + 2 Y2 + : : : + j 1 Yj
^ =
The fitted model consists of the regression parameter estimates
2
(^0 ; ^1 ; : : : ; ^j 1 ) and the associated covariance matrix ^j Vj , where Vj is the
usual X0 X inverse matrix from the intercept and variables Y1 ; Y2 ; :::; Yj 1 .
For each imputation, new parameters = (0 ; 1 ; :::; (j 1) ) and 2j are drawn
from the posterior predictive distribution of the parameters. That is, they are simulated from (^0 ; ^1 ; :::; ^j 1 ), j2 , and Vj . The variance is drawn as
j )=g
where g is a 2nj p+j 1 random variate and nj is the number of nonmissing observations for Yj . The regression coefficients are drawn as
0 Z
= ^ + j Vhj
0 is the upper triangular matrix in the Cholesky decomposition
where Vhj
0
Vj = Vhj Vhj and Z is a vector of j independent random normal variates.
These simulated values of and 2j are then used to re-create the parameters
and . For a detailed description of how to produce monotone-missingness with the
MCMC method for a multivariate normal data, refer to Schafer (1997, pp. 226235).
166
INITIAL=EM Specifications
The EM algorithm is used to find the maximum likelihood estimates for incomplete
data in the EM statement. You can also use the EM algorithm to find a posterior
mode, the parameter estimates that maximize the observed-data posterior density.
The resulting posterior mode provides a good starting value for the MCMC process.
With INITIAL=EM, PROC MI uses the MLE of the parameter vector as the initial
estimates in the EM algorithm for the posterior mode. You can use the ITPRINT
option in INITIAL=EM to display the iteration history for the EM algorithm.
You can use the CONVERGE= option to specify the convergence criterion in deriving
the EM posterior mode. The iterations are considered to have converged when the
maximum change in the parameter estimates between iteration steps is less than the
value specified. By default, CONVERGE=1E-4.
Convergence in MCMC
167
You can also use the MAXITER= option to specify the maximum number of iterations in the EM algorithm. By default, MAXITER=200.
With the BOOTSTRAP option, you can use overdispersed starting values for the
MCMC process. In this case, PROC MI applies the EM algorithm to a bootstrap
sample, a simple random sample with replacement from the input data set, to derive
the initial estimates for each chain (Schafer 1997, p. 128).
Convergence in MCMC
The theoretical convergence of the MCMC process has been explored under various
conditions, as described in Schafer (1997, p. 70). However, in practice, verification
of convergence is not a simple matter and cannot be easily implemented in the MI
procedure.
The parameters used in the imputation step for each iteration can be saved in an output
data set with the OUTITER= option. These include the means, standard deviations,
covariances, the worst linear function, and observed-data LR statistics. You can then
monitor the convergence in a single chain by displaying time-series plots and autocorrelations for those parameter values (Schafer 1997, p. 120). The time-series and
autocorrelation function plots for parameters such as variable means, covariances,
and the worst linear function can be displayed by specifying the TIMEPLOT and
ACFPLOT option.
You can apply EM to a bootstrap sample to obtain overdispersed starting values for
multiple chains (Gelman and Rubin 1992). This provides a conservative estimate of
the number of iterations needed before each imputation.
The next four subsections provide useful statistics and plots that can be used to check
the convergence of the MCMC process.
LR Statistics
You can save the observed-data likelihood ratio (LR) statistic in each iteration with
the LR option in the OUTITER= data set. The statistic is based on the observeddata likelihood with parameter values used in the iteration and the observed-data
maximum likelihood derived from the EM algorithm.
In each iteration, the LR statistic is given by
^
2 log f (^i )
f ( )
168
w( ) = v0 ( ^ )
Time-Series Plot
A time-series plot for a parameter is a scatter plot of successive parameter estimates
i against the iteration number i. The plot provides a simple way to examine the
convergence behavior of the estimation algorithm for . Long-term trends in the plot
indicate that successive iterations are highly correlated and that the series of iterations
has not converged.
You can display time-series plots for the worst linear function, the variable means,
variable variances, and covariances of variables. You can also request logarithmic
transformations for positive parameters in the plots with the LOG option. When a
parameter value is less than or equal to zero, the value is not displayed in the corresponding plot.
Convergence in MCMC
169
By default, the MI procedure uses the plus sign (+) as the plot symbol to display the
points with a height of one (percentage screen unit) in a time-series plot. You can use
the SYMBOL=, CSYMBOL=, and HSYMBOL= options to change the shape, color,
and height of the plot symbol.
By default, the plot title Time-Series Plot is displayed in a time-series plot. You
can request another title by using the TITLE= option in TIMEPLOT. When another
title is also specified in a TITLE statement, this title is displayed as the main title and
the plot title is displayed as a subtitle in the plot.
You can use options in the GOPTIONS statement to change the color and height of the
title. Refer to the chapter The SAS/GRAPH Statements in SAS/GRAPH Software:
Reference, Version 8 for a description of title options. See Example 9.6 for a usage
of the time-series plot.
k =
Cov(i ; i+k )
Var(i )
rk =
Pn
)(i+k )
( )2
k
i=1Pi
n
i=1 i
You can display autocorrelation function plots for the worst linear function, the variable means, variable variances, and covariances of variables. You can also request
logarithmic transformations for parameters in the plots with the LOG option. When a
parameter has values less than or equal to zero, the corresponding plot is not created.
You specify the maximum number of lags of the series with the NLAG= option. The
autocorrelations at each lag less than or equal to the specified lag are displayed in the
graph. In addition, the plot also displays approximate 95% confidence limits for the
autocorrelations. At lag k , the confidence limits indicate a set of approximate 95%
critical values for testing the hypothesis j = 0; j k:
By default, the MI procedure uses the star sign (*) as the plot symbol to display the
points with a height of one (percentage screen unit) in the plot, a solid line to display
the reference line of zero autocorrelation, vertical line segments to connect autocorrelations to the reference line, and a pair of dashed lines to display approximately 95%
confidence limits for the autocorrelations.
You can use the SYMBOL=, CSYMBOL=, and HSYMBOL= options to change the
shape, color, and height of the plot symbol, and the CNEEDLES= and WNEEDLES=
options to change the color and width of the needles. You can also use the LREF=,
CREF=, and WREF= options to change the line type, color, and width of the reference line. Similarly, you can use the LCONF=, CCONF=, and WCONF= options to
change the line type, color, and width of the confidence limits.
SAS OnlineDoc: Version 8
170
The input DATA= data set is an ordinary SAS data set containing multivariate data
with missing values.
INEST=SAS-data-set
The input INEST= data set is a TYPE=EST data set and contains a variable
Imputation to identify the imputation number. For each imputation, PROC
MI reads the point estimate from the observations with TYPE =PARM or
TYPE =PARMS and the associated covariances from the observations with
TYPE =COV or TYPE =COVB. These estimates are used as the reference
distribution to impute values for observations in the DATA= data set. When the input INEST= data set also contains observations with TYPE =SEED, PROC MI
reads the seed information for the random number generator from these observations.
Otherwise, the SEED= option provides the seed information. See Example 9.8 for a
usage of this option.
INITIAL=INPUT=SAS-data-set
The input INITIAL=INPUT= data set is a TYPE=COV or CORR data set and provides initial parameter estimates for the MCMC process. The covariances derived
from the TYPE=COV/CORR data set are divided by the number of observations to
get the correct covariance matrix for the point estimate (sample mean).
If TYPE=COV, PROC MI reads the number of observations from the observations with TYPE =N, the point estimate from the observations
with TYPE =MEAN, and the covariances from the observations with
TYPE =COV.
If TYPE=CORR, PROC MI reads the number of observations from the observations with TYPE =N, the point estimate from the observations with
TYPE =MEAN, the correlations from the observations with TYPE =CORR,
and the standard deviations from the observations with TYPE =STD.
171
PRIOR=INPUT=SAS-data-set
The input PRIOR=INPUT= data set is a TYPE=COV data set that provides information for the prior distribution. You can use the data set to specify a prior distribution
for of the form
W 1 ( d ; d S )
where d = n 1 is the degrees of freedom. PROC MI reads the matrix S from
observations with TYPE =COV and n from observations with TYPE =N.
You can also use this data set to specify a prior distribution for of the form
1
N 0 ;
n
PROC MI reads the mean vector 0 from observations with TYPE =MEAN
and n0 from observations with TYPE =N MEAN. When there are no observations with TYPE =N MEAN, PROC MI reads n0 from observations with
TYPE =N.
The OUT= data set contains all the variables in the original data set and a new variable
named Imputation that identifies the imputation. For each imputation, the data set
contains all variables in the input DATA= data set with missing values replaced by
imputed values.
OUTEM=SAS-data-set
The OUTEM= data set is a TYPE=COV data set and contains the MLE computed
with the EM algorithm. The observations with TYPE =MEAN contain the estimated mean and the observations with TYPE =COV contain the estimated covariances.
OUTEST=SAS-data-set
The OUTEST= data set is a TYPE=EST data set and contains parameter estimates
used in each imputation in the MCMC method. It also includes an index variable
named Imputation , which identifies the imputation.
172
The OUTITER= data set in an EM statement is a TYPE=COV data set and contains
parameters for each iteration. It also includes a variable Iteration that provides
the iteration number.
The parameters in the output data set depend on the options specified. You can
specify the MEAN and COV options for OUTITER. With the MEAN option, the
output data set contains the mean parameters in observations with the variable
TYPE =MEAN. Similarly, with the MEAN option, the output data set contains
the covariance parameters in observations with the variable TYPE =COV. When
no options are specified, the output data set contains the mean parameters for each
iteration.
OUTITER < ( options ) > =SAS-data-set in a MCMC statement
The OUTITER= data set in a MCMC statement is a TYPE=COV data set and contains
parameters used in the imputation step for each iteration. It also includes variables
named Imputation and Iteration , which provide the imputation number and
iteration number.
The parameters in the output data set depend on the options specified. The following
table summarizes the options available for OUTITER and the corresponding values
for the output variable TYPE .
Table 9.3.
Options
MEAN
STD
COV
LR
LR POST
WLF
Output Parameters
mean parameters
standard deviations
covariances
-2 log LR statistic
-2 log LR statistic of the posterior mode
worst linear function
TYPE
MEAN
STD
COV
LOG LR
LOG POST
WLF
When no options are specified, the output data set contains the mean parameters used
in the imputation step for each iteration. For a detailed description of the worst linear
function and LR statistics, see the Convergence in MCMC section on page 167.
173
Q=
m
1X
Q^i
m
i=1
Suppose U is the within-imputation variance, which is the average of the m completedata estimates:
U=
m
1X
U^i
m
i=1
B=
m
1 X
(Q^ Q)2
m 1 i=1 i
Then the variance estimate associated with Q is the total variance (Rubin 1987)
T = U + (1 +
1 )B
vm = (m 1)[1 +
t with vm degrees of
(1 + m 1 )B ]
When the complete-data degrees of freedom v0 is small, and there is only a modest
proportion of missing data, the computed degrees of freedom, vm , can be much larger
than v0 , which is inappropriate. Barnard and Rubin (1999) recommend the use of an
adjusted degrees of freedom
1
1
m= v +v
^obs
m
v
= (1 + m 1 )B=T .
, for inference.
Note that the MI procedure uses the adjusted degrees of freedom, vm
The degrees of freedom vm depends on m and the ratio
where
r=
and
(1 + m 1 )B
U
174
^ =
r + 2=(v + 3)
r+1
Both statistics r and are helpful diagnostics for assessing how the missing data
contribute to the uncertainty about Q.
RE = (1 +
)
m
The following table shows relative efficiencies with different values of m and .
For cases with little missing information, only a small number of imputations are
necessary.
Table 9.4.
Relative Efficiency
3
5
10
20
10%
0:9677
0:9804
0:9901
0:9950
20%
0:9375
0:9615
0:9804
0:9901
30%
0:9091
0:9434
0:9709
0:9852
50%
0:8571
0:9091
0:9524
0:9756
70%
0:8108
0:8772
0:9346
0:9662
175
For example, consider the same trivariate data set with variables Y1 and Y2 fully
observed, and a variable Y3 with missing values. An imputer creates multiple imputations with the model Y3 = Y1 Y2 . However, the analyst can later use the simpler
model Y3 = Y1 . In this case, the analyst assumes more than the imputer. That is, the
analyst assumes there is no relationship between variables Y3 and Y2 .
The effect of the discrepancy between the models depends on whether the analysts
additional assumption is true. If the assumption is true, the imputers model still
applies. The inferences derived from multiple imputations will still be valid, although
they may be somewhat conservative because they reflect the additional uncertainty of
estimating the relationship between Y3 and Y2 .
On the other hand, suppose that the analyst models Y3 = Y1 , and there is a relationship between variables Y3 and Y2 . Then the model Y3 = Y1 will be biased and is
inappropriate. Appropriate results can be generated only from appropriate analysts
models.
Another type of discrepancy occurs when the imputer assumes more than the analyst.
For example, suppose that an imputer creates multiple imputations with the model
Y3 = Y1 , but the analyst later fits a model Y3 = Y1 Y2 . When the assumption is true,
the imputers model is a correct model and the inferences still hold.
On the other hand, suppose there is a relationship between Y3 and Y2 . Imputations
created under the incorrect assumption that there is no relationship between Y3 and
Y2 will make the analysts estimate of the relationship biased toward zero. Multiple
imputations created under an incorrect model can lead to incorrect conclusions.
Thus, generally you should include as many variables as you can when doing multiple imputation. The precision you lose when you include unimportant predictors
is usually a relatively small price to pay for the general validity of analyses of the
resultant multiply imputed data set (Rubin 1996).
Note that it is good practice to include a description of the imputers model with
the multiply imputed data set. That way, the analysts will have information about
the variables involved in the imputation and which relationships among the variables
have been implicitly set to zero.
176
Description
Model information
Missing data patterns
Variable Transformations
Univariate
Corr
EMInitEst
EMEst
EMIter
EMPIter
EMPEst
EMWlf
MCMCInitEst
VarianceInfo
ParmEst
Option
TRANSFORM
statement
SIMPLE
SIMPLE
EM statement
EM statement
ITPRINT in EM
statement
ITPRINT in
INITIAL=EM
INITIAL=EM
WLF
DISPLAYINIT
in MCMC
Example 9.1.
177
Examples
The following FitMono data set has a monotone missing data pattern and is used in
Example 9.2 with the propensity score method and in Example 9.3 with the regression
method. The FitMiss data set created in the Getting Started section is used in other
examples. Note that the original data set has been altered for these examples.
*----------------- Data on Physical Fitness -----------------*
| These measurements were made on men involved in a physical |
| fitness course at N.C. State University.
|
| Only selected variables of
|
| Oxygen (oxygen intake, ml per kg body weight per minute), |
| Runtime (time to run 1.5 miles in minutes), and
|
| RunPulse (heart rate while running) are used.
|
| Certain values were changed to missing for the analysis.
|
*------------------------------------------------------------*;
data FitMono;
input Oxygen RunTime RunPulse @@;
datalines;
44.609 11.37 178
45.313 10.07 185
54.297
8.65 156
59.571
.
.
49.874
9.22
.
44.811 11.63 176
45.681 11.95 176
49.091 10.85
.
39.442 13.08 174
60.055
8.63 170
50.541
.
.
37.388 14.03 186
44.754 11.12 176
47.273
.
.
51.855 10.33 166
49.156
8.95 180
40.836 10.95 168
46.672 10.00
.
46.774 10.25
.
50.388 10.08 168
39.407 12.63 174
46.080 11.17 156
45.441
9.63 164
54.625
8.92 146
45.118 11.08
.
39.203 12.88 168
45.790 10.47 186
50.545
9.93 148
48.673
9.40 186
47.920 11.50 170
47.467 10.50 170
;
Note when you specify the option NIMPUTE=0, the missing values will not be imputed. The procedure generates the following output:
SAS OnlineDoc: Version 8
178
Model Information
The MI Procedure
Model Information
Data Set
Method
Multiple Imputation Chain
Initial Estimates for MCMC
Start
Prior
Number of Imputations
Number of Burn-in Iterations
Number of Iterations
Seed for random number generator
WORK.FITMISS
MCMC
Single Chain
EM Posterior Mode
Starting Value
Jeffreys
0
200
100
55417
The Model Information table describes the method and options used in the procedure.
Output 9.1.2.
Group
1
2
3
4
5
Oxygen
Run
Time
Run
Pulse
X
X
X
.
.
X
X
.
X
X
X
.
.
X
.
Freq
Percent
21
4
3
1
2
67.74
12.90
9.68
3.23
6.45
Group
1
2
3
4
5
-----------------Group Means---------------Oxygen
RunTime
RunPulse
46.353810
47.109500
52.461667
.
.
10.809524
10.137500
.
11.950000
9.885000
171.666667
.
.
176.000000
.
The Missing Data Patterns table lists distinct missing data patterns with corresponding frequencies and percents. Here, X means that the variable is observed
in the corresponding group and . means that the variable is missing. The table also
displays group-specific variable means.
Example 9.1.
179
With the SIMPLE option, the procedure displays simple descriptive univariate statistics for available cases in the Univariate Statistics table and correlations from pairwise available cases in the Pairwise Correlations table.
Output 9.1.3.
Univariate Statistics
The MI Procedure
Univariate Statistics
Variable
Mean
Std Dev
Minimum
Maximum
Oxygen
RunTime
RunPulse
28
28
22
47.11618
10.68821
171.86364
5.41305
1.37988
10.14324
37.38800
8.63000
148.00000
60.05500
14.03000
186.00000
Output 9.1.4.
Pairwise Correlations
The MI Procedure
Pairwise Correlations
Oxygen
RunTime
RunPulse
Oxygen
RunTime
RunPulse
1.000000000
-0.849118562
-0.343961742
-0.849118562
1.000000000
0.247258191
-0.343961742
0.247258191
1.000000000
With the EM statement, the procedure displays the initial parameter estimates for
EM.
Output 9.1.5.
_TYPE_
_NAME_
MEAN
COV
COV
COV
Oxygen
RunTime
RunPulse
Oxygen
RunTime
RunPulse
47.116179
29.301078
0
0
10.688214
0
1.904067
0
171.863636
0
0
102.885281
180
_Iteration_
-2 Log L
Oxygen
RunTime
RunPulse
0
1
2
3
4
5
6
7
8
9
10
289.544782
263.549489
255.851312
254.616428
254.494971
254.483973
254.482920
254.482813
254.482801
254.482800
254.482800
47.116179
47.116179
47.139089
47.122353
47.111080
47.106523
47.104899
47.104348
47.104165
47.104105
47.104086
10.688214
10.688214
10.603506
10.571685
10.560585
10.556768
10.555485
10.555062
10.554923
10.554878
10.554864
171.863636
171.863636
171.538203
171.426790
171.398296
171.389208
171.385257
171.383345
171.382424
171.381992
171.381796
The procedure then displays the EM (MLE) parameter estimates, the maximum likelihood estimates for and of a multivariate normal distribution from the data set
FitMiss.
Output 9.1.7.
_TYPE_
_NAME_
MEAN
COV
COV
COV
Oxygen
RunTime
RunPulse
Oxygen
RunTime
RunPulse
47.104086
27.798014
-6.457929
-18.030790
10.554864
-6.457929
2.015491
3.516092
171.381796
-18.030790
3.516092
97.766559
Example 9.2.
181
You can also output the EM (MLE) parameter estimates into an output data set with
the OUTEM= option. The following statements list the observations in the output
data set outem.
proc print data=outem;
title EM Estimates;
run;
Output 9.1.8.
EM Estimates
EM Estimates
Obs
_TYPE_
_NAME_
Oxygen
RunTime
RunPulse
1
2
3
4
MEAN
COV
COV
COV
Oxygen
RunTime
RunPulse
47.1041
27.7980
-6.4579
-18.0308
10.5549
-6.4579
2.0155
3.5161
171.382
-18.031
3.516
97.767
The output data set outem is a TYPE=COV data set. The observation with
TYPE =MEAN contains the MLE for the parameter and the observations with
TYPE =COV contain the MLE for the parameter of a multivariate normal
distribution from the data set FitMiss.
Note that the VAR statement is required and the data set must have a monotone missing pattern with variables as ordered in the VAR statement. The procedure generates
the following output:
Output 9.2.1.
Model Information
The MI Procedure
Model Information
Data Set
Method
Number of Imputations
Number of Groups on Propensity
Seed for random number generator
WORK.FITMONO
Propensity
5
5
55417
182
Group
1
2
3
Oxygen
Run
Time
Run
Pulse
X
X
X
X
X
.
X
.
.
Freq
Percent
23
5
3
74.19
16.13
9.68
Group
1
2
3
46.684174
47.505800
52.461667
10.776957
10.280000
.
170.739130
.
.
The Missing Data Patterns table lists distinct missing data patterns with corresponding frequencies and percents. Here, X means that the variable is observed
in the corresponding group and . means that the variable is missing. The table also
displays group-specific variable means.
Output 9.2.3.
Variance Information
The MI Procedure
Variable
-----------------Variance----------------Between
Within
Total
RunTime
RunPulse
0.001068
1.147555
0.059100
4.686646
0.060382
6.063711
DF
27.498
17.006
Variable
Relative
Increase
in Variance
Fraction
Missing
Information
RunTime
RunPulse
0.021688
0.293828
0.021448
0.246288
After the completion of m imputations, the Multiple Imputation Variance Information table displays the between-imputation variance, within-imputation variance,
and total variance for combining complete-data inferences. It also displays the degrees of freedom for the total variance. The relative increase in variance due to miss-
Example 9.2.
183
ingness and the fraction of missing information for each variable are also displayed.
A detailed description of these statistics is provided in the Combining Inferences
from Multiply Imputed Data Sets section on page 173.
The Multiple Imputation Parameter Estimates table displays the estimated mean
and standard error of the mean for each variable. The inferences are based on the
t-distributions. For each variable, the table also displays a 95% mean confidence
interval and a t-statistic with the associated p-value for the hypothesis that the population mean is equal to the value specified in the MU0= option, which is zero by
default.
Output 9.2.4.
Parameter Estimates
The MI Procedure
Multiple Imputation Parameter Estimates
Variable
Mean
Std Error
RunTime
RunPulse
10.603677
170.400000
0.245727
2.462460
11.1074
175.5952
DF
27.498
17.006
Variable
Minimum
Maximum
Mu0
t for H0:
Mean=Mu0
Pr > |t|
RunTime
RunPulse
10.558065
168.967742
10.648387
171.838710
0
0
43.15
69.20
<.0001
<.0001
The following statements list the first ten observations of the data set outpscore.
proc print data=outpscore(obs=10);
title First 10 Observations of the Imputed Data Set;
run;
Output 9.2.5.
Obs
1
2
3
4
5
6
7
8
9
10
_Imputation_
Oxygen
Run
Time
Run
Pulse
1
1
1
1
1
1
1
1
1
1
44.609
45.313
54.297
59.571
49.874
44.811
45.681
49.091
39.442
60.055
11.37
10.07
8.65
8.63
9.22
11.63
11.95
10.85
13.08
8.63
178
185
156
146
156
176
176
156
174
170
184
mu0= 50 10 150
The ROUND= option is used to round the imputed values to the same precision as
observed values. The values specified with the ROUND= option are matched with the
variables Oxygen, RunTime, and RunPulse in the order listed with the VAR statement. The MU0= option requests t tests for the hypotheses that the population means
corresponding to the variables in the VAR statement are Oxygen=50, RunTime=10,
and RunPulse=150.
The Missing Data Patterns table lists distinct missing data patterns with corresponding frequencies and percents. It is identical to the table in the previous example.
After the completion of five imputations by default, the Multiple Imputation Variance Information table displays the between-imputation variance, within-imputation
variance, and total variance for combining complete-data inferences. The relative increase in variance due to missingness and the fraction of missing information for
each variable are also displayed. These statistics are described in the Combining
Inferences from Multiply Imputed Data Sets section on page 173.
Output 9.3.1.
Variance Information
The MI Procedure
Variable
-----------------Variance----------------Between
Within
Total
RunTime
RunPulse
0.004443
1.790531
0.068684
4.045134
0.074016
6.193770
DF
25.294
11.846
Variable
Relative
Increase
in Variance
Fraction
Missing
Information
RunTime
RunPulse
0.077629
0.531166
0.074435
0.382947
The Multiple Imputation Parameter Estimates table displays a 95% mean confidence interval and a t-statistic with its associated p-value for each of the hypotheses
requested with the MU0= option.
Example 9.4.
Output 9.3.2.
MCMC Method
185
Parameter Estimates
The MI Procedure
Multiple Imputation Parameter Estimates
Variable
Mean
Std Error
RunTime
RunPulse
10.575871
170.425806
0.272059
2.488729
11.1359
175.8561
DF
25.294
11.846
Variable
Minimum
Maximum
Mu0
t for H0:
Mean=Mu0
Pr > |t|
RunTime
RunPulse
10.506452
169.290323
10.680968
171.935484
10.000000
150.000000
2.12
8.21
0.0443
<.0001
The following statements list the first ten observations of the data set outreg. Note
that the imputed values rounded to the same precision as the observed values.
proc print data=outreg(obs=10);
title First 10 Observations of the Imputed Data Set;
run;
Output 9.3.3.
Obs
1
2
3
4
5
6
7
8
9
10
_Imputation_
Oxygen
Run
Time
Run
Pulse
1
1
1
1
1
1
1
1
1
1
44.609
45.313
54.297
59.571
49.874
44.811
45.681
49.091
39.442
60.055
11.37
10.07
8.65
7.18
9.22
11.63
11.95
10.85
13.08
8.63
178
185
156
156
192
176
176
174
174
170
186
Model Information
The MI Procedure
Model Information
Data Set
Method
Multiple Imputation Chain
Initial Estimates for MCMC
Start
Prior
Number of Imputations
Number of Burn-in Iterations
Seed for random number generator
WORK.FITMISS
MCMC
Multiple Chains
EM Posterior Mode
Starting Value
Jeffreys
3
200
55417
With CHAIN=MULTIPLE, the procedure uses multiple chains and completes the default 200 burn-in iterations before each imputation. The 200 burn-in iterations are
used to make the iterations converge to the stationary distribution before the imputation.
By default, the procedure uses a noninformative Jeffreys prior to derive the posterior
mode from the EM algorithm as the starting values for the MCMC process.
The following Missing Data Patterns table lists distinct missing data patterns with
corresponding statistics.
Output 9.4.2.
Group
1
2
3
4
5
Oxygen
Run
Time
Run
Pulse
X
X
X
.
.
X
X
.
X
X
X
.
.
X
.
Freq
Percent
21
4
3
1
2
67.74
12.90
9.68
3.23
6.45
Group
1
2
3
4
5
-----------------Group Means---------------Oxygen
RunTime
RunPulse
46.353810
47.109500
52.461667
.
.
10.809524
10.137500
.
11.950000
9.885000
171.666667
.
.
176.000000
.
With the ITPRINT option in INITIAL=EM, the procedure also displays the EM
(Posterior Mode) Iteration History table.
Example 9.4.
Output 9.4.3.
MCMC Method
187
_Iteration_
-2 Log L
-2 Log Posterior
Oxygen
RunTime
0
1
2
3
4
5
6
7
254.482800
255.081159
255.271405
255.318621
255.330259
255.333160
255.333896
255.334085
282.909590
282.051588
282.017488
282.015372
282.015232
282.015222
282.015222
282.015222
47.104086
47.104079
47.104077
47.104002
47.103861
47.103797
47.103774
47.103766
10.554864
10.554859
10.554858
10.554524
10.554388
10.554341
10.554325
10.554320
RunPulse
0
1
2
3
4
5
6
7
171.381796
171.381708
171.381669
171.381853
171.382058
171.382152
171.382186
171.382197
With the DISPLAYINIT option in the MCMC statement, the following Initial Parameter Estimates for MCMC table displays the starting mean and covariance estimates used in MCMC. The same starting estimates are used for the MCMC process
for multiple chains because the EM algorithm is applied to the same data set in each
chain. You can explicitly specify different initial estimates for different imputations,
or you can use the bootstrap to generate different parameter estimates from the EM
algorithm for the MCMC process.
Output 9.4.4.
_TYPE_
_NAME_
MEAN
COV
COV
COV
Oxygen
RunTime
RunPulse
Oxygen
RunTime
RunPulse
47.103766
24.549968
-5.726112
-15.926034
10.554320
-5.726112
1.781407
3.124798
171.382197
-15.926034
3.124798
83.164044
The following two tables display variance information and parameter estimates from
the multiple imputation.
188
Variance Information
The MI Procedure
Variable
Oxygen
RunTime
RunPulse
0.009200
0.002255
0.043126
0.987880
0.069112
3.650388
1.000148
0.072119
3.707889
DF
27.778
26.388
27.653
Variable
Relative
Increase
in Variance
Fraction
Missing
Information
Oxygen
RunTime
RunPulse
0.012418
0.043503
0.015752
0.012414
0.043351
0.015744
Output 9.4.6.
Parameter Estimates
The MI Procedure
Multiple Imputation Parameter Estimates
Variable
Mean
Std Error
Oxygen
RunTime
RunPulse
47.198228
10.510911
172.113649
1.000074
0.268549
1.925588
49.2475
11.0625
176.0603
DF
27.778
26.388
27.653
Variable
Minimum
Maximum
Mu0
t for H0:
Mean=Mu0
Pr > |t|
Oxygen
RunTime
RunPulse
47.132351
10.456079
171.943144
47.308274
10.538446
172.344920
50.000000
10.000000
180.000000
-2.80
1.90
-4.10
0.0092
0.0681
0.0003
Example 9.5.
189
Output 9.5.1.
Model Information
The MI Procedure
Model Information
Data Set
Method
Multiple Imputation Chain
Initial Estimates for MCMC
Start
Prior
Number of Imputations
Number of Burn-in Iterations
Number of Iterations
Seed for random number generator
WORK.FITMISS
Monotone-data MCMC
Single Chain
EM Posterior Mode
Starting Value
Jeffreys
5
200
100
55417
The following Missing Data Patterns table lists distinct missing data patterns with
corresponding statistics. Here, an X means that the variable is observed in the
corresponding group, a . means that the variable is missing and will be imputed to
achieve the monotone missingness for the imputed data set, and an O means that
the variable is missing and will not be imputed. The table also displays group-specific
variable means.
Output 9.5.2.
Group
1
2
3
4
5
Oxygen
Run
Time
Run
Pulse
X
X
X
.
.
X
X
O
X
X
X
O
O
X
O
Freq
Percent
21
4
3
1
2
67.74
12.90
9.68
3.23
6.45
Group
1
2
3
4
5
-----------------Group Means---------------Oxygen
RunTime
RunPulse
46.353810
47.109500
52.461667
.
.
10.809524
10.137500
.
11.950000
9.885000
171.666667
.
.
176.000000
.
190
Output 9.5.3.
Group
1
2
3
Oxygen
Run
Time
Run
Pulse
X
X
X
X
X
.
X
.
.
Freq
Percent
22
6
3
70.97
19.35
9.68
Group
1
2
3
-----------------Group Means---------------Oxygen
RunTime
RunPulse
46.307744
46.372151
52.461667
10.861364
10.053333
.
171.863636
.
.
The following statements impute one value for each missing value in the monotone
missingness data set outmono. The variable Imputation is renamed to Impute
so that it will not be overwritten by the the new variable Imputation being created
in the MI procedure.
proc mi data=outmono( rename=(_Imputation_=Impute))
nimpute=1 seed=43672
out=outds( rename=(Impute=_Imputation_) drop=_Imputation_);
monotone method=reg;
var Oxygen RunTime RunPulse;
by Impute;
run;
Example 9.6.
191
The variable Impute is renamed to Imputation in the output data outds. This
makes the output data set have the same structure as output data sets generated from
other imputation methods. You can then analyze these data sets by using other SAS
procedures and combine these results by using the procedure MIANALYZE. Note
that the VAR statement is required with a MONOTONE statement to provide the
variable order for the monotone missing pattern.
By default, the MI procedure uses the plus sign (+) as the plot symbol to display
the points in the plot. The time-series plot shows no apparent trends for the variable
Oxygen.
192
By default, the MI procedure uses the star sign (*) as the plot symbol to display the
points in the plot, a solid line to display the reference line of zero autocorrelation,
and a pair of dashed lines to display approximately 95% confidence limits for the
autocorrelations. The autocorrelation function plot shows no significant positive or
negative autocorrelation.
The following statements use display options to modify the autocorrelation function
plot for Oxygen.
proc mi data=FitMiss seed=37921 noprint nimpute=2;
mcmc acfplot(mean(Oxygen) / symbol=dot lref=2);
var Oxygen RunTime RunPulse;
run;
Example 9.6.
Output 9.6.3.
193
You can also create plots for the worst linear function, the means of other variables,
the variances of variables, and covariances between variables. Alternatively, you can
use the OUTITER option to save statistics such as the means, standard deviations,
covariances, -2 log LR statistic, -2 log LR statistic of the posterior mode, and worst
linear function from each iteration in an output data set. Then you can do a more
in-depth time-series analysis of the iterations with other procedures, such as PROC
AUTOREG and PROC ARIMA in the SAS/ETS Users Guide, Version 8.
194
The following Missing Data Patterns table lists distinct missing data patterns with
corresponding statistics for the FitMiss data. Note that the values of Oxygen shown
in the tables are transformed values.
Output 9.7.1.
Group
1
2
3
4
5
Oxygen
Run
Time
Run
Pulse
X
X
X
.
.
X
X
.
X
X
X
.
.
X
.
Freq
Percent
21
4
3
1
2
67.74
12.90
9.68
3.23
6.45
Group
1
2
3
4
5
-----------------Group Means---------------Oxygen
RunTime
RunPulse
3.829760
3.851813
3.955298
.
.
10.809524
10.137500
.
11.950000
9.885000
171.666667
.
.
176.000000
.
Example 9.7.
Transformation to Normality
195
The following Variable Transformations table lists the variables that have been
transformed.
Output 9.7.2.
_Transform_
Oxygen
LOG
The following Initial Parameter Estimates for MCMC table displays the starting
mean and covariance estimates used in the MCMC process.
Output 9.7.3.
_TYPE_
_NAME_
MEAN
COV
COV
COV
Oxygen
RunTime
RunPulse
Oxygen
RunTime
RunPulse
3.846122
0.010827
-0.120891
-0.328772
10.557605
-0.120891
1.744580
3.011179
171.382949
-0.328772
3.011179
82.747608
196
Variance Information
The MI Procedure
Multiple Imputation Variance Information
-----------------Variance----------------Between
Within
Total
Variable
* Oxygen
RunTime
RunPulse
0.000004541
0.000814
0.182700
0.000398
0.063128
3.498974
0.000404
0.064105
3.718214
DF
27.766
27.708
25.923
* Transformed Variables
Multiple Imputation Variance Information
Variable
Relative
Increase
in Variance
Fraction
Missing
Information
* Oxygen
RunTime
RunPulse
0.013685
0.015478
0.062658
0.013590
0.015356
0.060595
* Transformed Variables
The following table displays parameter estimates from the multiple imputation. Note
that the parameter value of Mu0 has also been transformed using the logarithmic
transformation.
Output 9.7.5.
Parameter Estimates
The MI Procedure
Multiple Imputation Parameter Estimates
Variable
Mean
Std Error
* Oxygen
RunTime
RunPulse
3.845991
10.586242
170.849654
0.020091
0.253190
1.928267
3.8872
11.1051
174.8138
DF
27.766
27.708
25.923
* Transformed Variables
Multiple Imputation Parameter Estimates
Variable
Minimum
Maximum
Mu0
t for H0:
Mean=Mu0
Pr > |t|
* Oxygen
RunTime
RunPulse
3.843860
10.547440
170.315955
3.848775
10.616746
171.324638
3.912023
10.000000
180.000000
-3.29
2.32
-4.75
0.0028
0.0282
<.0001
* Transformed Variables
Example 9.7.
Transformation to Normality
197
The following statements list the first ten observations of the data set outmi. Note
that the values for Oxygen are in the original scale.
proc print data=outmi(obs=10);
title First 10 Observations of the Imputed Data Set;
run;
Output 9.7.6.
Obs
1
2
3
4
5
6
7
8
9
10
_Imputation_
Oxygen
RunTime
Run
Pulse
1
1
1
1
1
1
1
1
1
1
44.6090
45.3130
54.2970
59.5710
49.8740
44.8110
43.4130
44.6435
39.4420
60.0550
11.3700
10.0700
8.6500
8.4840
9.2200
11.6300
11.9500
10.8500
13.0800
8.6300
178.000
185.000
156.000
155.503
166.031
176.000
176.000
173.761
174.000
170.000
The preceding results can also be produced from the following statements without
using a TRANSFORM statement.
data temp;
set FitMiss;
LogOxygen= log(Oxygen);
run;
proc mi data=temp seed=37921 mu0=3.91202 10 180 out=outtemp;
mcmc chain=multiple displayinit;
var LogOxygen RunTime RunPulse;
run;
data outmi;
set outtemp;
Oxygen= exp(LogOxygen);
run;
198
The following statements list the parameters used for the imputations. Note that the
data set includes observations with TYPE =SEED containing the seed to start the
next random number generator.
proc print data=miest;
title Parameters for the Imputations;
run;
Output 9.8.1.
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
SEED
PARM
COV
COV
COV
SEED
PARM
COV
COV
COV
SEED
PARM
COV
COV
COV
_NAME_
Oxygen
RunTime
RunPulse
Oxygen
RunTime
RunPulse
Oxygen
RunTime
RunPulse
Oxygen
RunTime
RunPulse
2099769086.00
49.31
32.05
-7.47
-28.32
419117425.00
47.49
41.02
-8.60
-34.29
535522494.00
45.98
43.24
-9.90
8.14
2099769086.00
10.00
-7.47
2.41
6.75
419117425.00
10.43
-8.60
2.25
7.61
535522494.00
10.82
-9.90
2.75
-2.72
2099769086.00
172.19
-28.32
6.75
128.61
419117425.00
171.58
-34.29
7.61
142.94
535522494.00
172.45
8.14
-2.72
218.32
The following statements invoke the MI procedure and use the INEST= option in the
MCMC statement.
proc mi data=FitMiss;
mcmc inest=miest;
var Oxygen RunTime RunPulse;
run;
References
Output 9.8.2.
199
Model Information
The MI Procedure
Model Information
Data Set
Method
INEST Data Set
Number of Imputations
WORK.FITMISS
MCMC
WORK.MIEST
3
The remaining tables for the example are identical to the tables in Example 9.4.
References
Anderson, T.W. (1984), An Introduction to Multivariate Statistical Analysis, Second
Edition, New York: John Wiley & Sons, Inc.
Allison, P.D. (2000), Multiple Imputation for Missing Data: A Cautionary Tale,
Sociological Methods and Research, 28, 301309.
Barnard, J. and Rubin, D.B. (1999), Small-Sample Degrees of Freedom with Multiple Imputation, Biometrika, 86, 948955.
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977), Maximum Likelihood from
Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society,
Ser. B., 39, 138.
Gelman, A. and Rubin, D.B. (1992), Inference from Iterative Simulation Using Multiple Sequences, Statistical Science, 7, 457472.
Goodnight, J.H. (1979), A Tutorial on the Sweep Operator, American Statistician,
33, 149158.
Lavori, P.W., Dawson, R., and Shera, D. (1995), A Multiple Imputation Strategy
for Clinical Trials with Truncation of Patient Data, Statistics in Medicine, 14,
19131925.
Li, K.H. (1988), Imputation Using Markov Chains, Journal of Statistical Computation and Simulation, 30, 5779.
Li, K.H., Raghunathan, T.E., and Rubin, D.B. (1991), Large-Sample Significance
Levels from Multiply Imputed Data Using Moment-Based Statistics and an F
Reference Distribution, Journal of the American Statistical Association, 86,
10651073.
Little, R.J.A. and Rubin, D.B. (1987), Statistical Analysis with Missing Data, New
York: John Wiley & Sons, Inc.
Liu, C. (1993), Bartletts Decomposition of the Posterior Distribution of the Covariance for Normal Monotone Ignorable Missing Data, Journal of Multivariate
Analysis, 46, 198206.
McLachlan, G.J. and Krishnan, T. (1997), The EM Algorithm and Extensions, New
York: John Wiley & Sons, Inc.
SAS OnlineDoc: Version 8
200