0% found this document useful (0 votes)
53 views

Appendix: Ps Matching in R: (With Attached Dataset and Code)

The document describes using propensity score matching in R to estimate the effect of ever smoking on health outcomes using a large national health survey dataset. It discusses estimating the propensity score using both logistic regression and boosted classification and regression trees, attempting several different matching strategies with varying levels of balance achieved, and settling on a final matched sample using boosted CART estimates, exact matching on sex, discarding untreated and treated units beyond the region of common support, and imposing a caliper of 0.2 standard deviations on the distance measure.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Appendix: Ps Matching in R: (With Attached Dataset and Code)

The document describes using propensity score matching in R to estimate the effect of ever smoking on health outcomes using a large national health survey dataset. It discusses estimating the propensity score using both logistic regression and boosted classification and regression trees, attempting several different matching strategies with varying levels of balance achieved, and settling on a final matched sample using boosted CART estimates, exact matching on sex, discarding untreated and treated units beyond the region of common support, and imposing a caliper of 0.2 standard deviations on the distance measure.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

1

APPENDIX: PS MATCHING IN R
(with attached dataset and code)

Brian Lee ([email protected])


Assistant Professor
Dept. of Epidemiology and Biostatistics
Drexel University School of Public Health

SPER workshop June 17, 2013


2

Example 1
• 1987 National Medical • What is the effect of
Expenditures Survey ever smoking on odds
• Persons 40+ with of lung cancer /
complete covariate data laryngeal cancer / or
• Exposure: ever smoking COPD, as compared
• Control: never smoking with never smoking?
• Outcome: lung cancer,
laryngeal cancer, or
COPD
• N = 11,587
3

Variables
• eversmk (exposure) • 1/0 ever smoker / never smoker
• lc5 (outcome) • Lung / laryngeal CA / COPD
• LASTAGE: age • In years
• MALE: sex • 1/0 male / female
• RACE3: race • Other / African American /
Caucasian
• beltuse: seatbelt use • Rare / some / always
• educate: • college grad / some college/HS
grad/other
• marital:
• Married / widowed / divorced /
separated / never married
• SREGION: census region
• NE / MW / S / W
• POVSTALB: poverty status
• Poor / near poor / low income/
middle income / high income
4

Selected sample characteristics


ALL Ever Smoke Never Smoke
N = 11,587 N = 6,564 (56.7%) N = 5,023 (43.3%)
Cancer/COPD 1.9% 3.0% 0.5%
Age 60.2 years 59.1 years 61.7 years
Male 43.4% 55.4% 27.7%
Caucasian 78.1% 79.8% 75.8%
Rare use seatbelt 24.0% 26.0% 21.5%
College grad 14.2% 13.2% 15.4%
From South 36.8% 35.2% 38.9%
Poor 10.7% 10.2% 11.4%
5

1. Estimate the PS
• Goal: to achieve covariate balance on confounders so that they
cannot bias results

• Take observed values of treatment (1/0), and build a model that


estimates it using covariates k as predictors
• Typically a parametric model used to estimate, e.g., logistic regression
• ps.model <‐ glm(treat ~ cov1 + cov2 + cov1*cov2 + cov3, 
data=dataset, family=binomial(link=“logit”))
• dataset$PS <‐ predict(ps.model, dataset, type=“response”)

• Take as much care in building the PS model as you would an


outcome model
• Misspecification of the PS model can result in bias (although not as
much as if you misspecify the outcome model)
6

1. Estimate the PS: Variable selection


• Begin with a DAG
• No post-treatment variables!
• Include variables that predict both treatment and outcome
• Little cost in including variables not related to the treatment but
related to the outcome
• Exclude variables that are strong predictors of treatment with
no obvious relation to the outcome
• Excluding a potentially important confounder can be costly in
terms of bias
• PS analyses have included from a few to over 100 covariates

• General recommendation: theory-driven approach for variable


selection
7

1. Estimate the PS: Variable parameterization


• Much like an outcome model, improper parameterization
of a variable can result in residual confounding

• Use splines, polynomial terms, and interactions

• Still, misestimation of the propensity score is not a large


problem as long as balance is obtained (e.g., exclusion of
interactions or squares may be less severe for a PS
model than for an outcome model) (Stuart 2010)
8

Machine learning
• Can use machine learning methods
to estimate the PS (Westreich 2010)
• E.g., neural nets, classification and
regression trees (CART)

• Random forests and boosted CART


work well for this (Lee et al. 2010)

• No need to specify non-linearities,


interactions, etc. – these methods
do automatically, just list the
variables you want
9

Machine learning
• R: To implement boosted CART :
• library(twang)
• ps.model <‐ ps(treatment ~ 
LASTAGE + MALE + educate + 
POVSTALB, data=data)
• data$PS <‐ ps.model$ps[, 1]

• R: To implement random forests:


• library(randomForest)
• ps.model <‐ randomForest(treatment 
~ LASTAGE + MALE + educate + 
POVSTALB, data=data)
• data$PS <‐ ps.model$votes[, 2]
10

1. Estimate the PS: missing data

• Observations with missing PS covariate data will be


excluded from the PS model, and therefore also further
analyses (i.e., the outcome model)

• Multiple imputation can be used to fill in missing data to


estimate propensity scores but this has not been well-
evaluated (Hayes, 2008)
11

Estimating the PS
Model 1 Model 2
Logistic regression model Boosted CART model

ps.model1 <- glm(eversmk ~ library(twang)


LASTAGE +MALE +educate +beltuse
+POVSTALB +marital +SREGION, ps.model2 <- ps(eversmk ~
data=a, family= LASTAGE + MALE + educate +
binomial(link="logit")) beltuse + POVSTALB + marital +
RACE3 + SREGION, data=a)
odds1 <- exp(predict(ps.model1,
a))

a$PS <- odds1/(1+odds1) a$PS2 <- ps.model2$ps[, 1]


12

Is age linearly associated with eversmoking?


13

Matching attempt #1
• For logistic regression estimated PS (Model 1)…

• Let’s try nearest neighbor 1:1 matching, greedy, no discarding of


treated or controls

• library(MatchIt)
• nn1 <- matchit(eversmk ~ LASTAGE + MALE + educate + beltuse +
POVSTALB + marital + RACE3 + SREGION, data=a, distance=a$PS,
method="nearest")
• nn1.data <- match.data(nn1)
The distance option is not always necessary; if
• summary(nn1, standardize=T)
option is left out, Matchit can automatically
calculate the PS based on the linear model as
Standardize in order to shown. For more complex PS models, e.g.
show standardized balance with nonlinearities and such, estimate the PS
beforehand and specify the resulting PS as the
distance measure, like shown here.
14

Balance, matching attempt #1


• BEFORE matching: ASAM = 0.103
15

Balance, matching attempt #1


• AFTER matching: ASAM = 0.163 (worse!)

N’s C T
All 5023 6564
Matched 5023 5023

Not 0 1541
matched

Discarded 0 0
16

So let’s try something new


• Problems with balance in age so let’s try the boosted
CART model that should more accurately model the age –
ever smoking relationship

• Because balance on gender was really bad, let’s try exact


matching on it

• And discard treated and controls beyond the range of


overlap
17

Matching attempt #2
• For boosted CART estimated PS (Model 2)…

• Nearest neighbor 1:1 matching, greedy, discarding treated or controls


beyond PS overlap, exact matching on sex

• library(MatchIt)
• nn2 <- matchit(eversmk ~ LASTAGE + MALE + educate + beltuse +
POVSTALB + marital + RACE3 + SREGION, data=a, distance=a$PS2,
method="nearest", exact="MALE", discard="both")
• nn2.data <- match.data(nn2)
• summary(nn2, standardize=T)
18

Balance, matching attempt #2

• AFTER matching: ASAM = 0.086..ok, but let’s do better

N’s C T
All 5023 6564
Matched 4320 4320

Not 677 2213


matched

Discarded 26 31
19

• There’s a lot of treated


units with high PS that
don’t seem to have good
control matches that also
have high PS

• Solution: try a caliper


20

Matching attempt #3
• For boosted CART estimated PS (Model 2)…

• Nearest neighbor 1:1 matching, greedy, discarding treated or controls


beyond PS overlap, exact matching on sex, caliper of 0.2 SD of the
distance measure

• library(MatchIt)
• nn3 <- matchit(eversmk ~ LASTAGE + MALE + educate + beltuse +
POVSTALB + marital + RACE3 + SREGION, data=a, distance=a$PS2,
method="nearest", exact="MALE", discard="both", caliper=0.2)
• nn3.data <- match.data(nn3)
• summary(nn3, standardize=T)
21

Match attempt #2 Match attempt #3

Going from #2 to #3: the


caliper makes the controls
look more similar to the
treated according to the PS
22

Balance, matching attempt #3

• AFTER matching: ASAM = 0.027..excellent!

N’s C T
All 5023 6564
Matched 4075 4075

Not 922 2458


matched

Discarded 26 31
23

Estimate treatment effect

• In PS matched dataset (from match attempt #3), fit the


outcome model

• m1 <- glm(lc5 ~ eversmk, data=nn3.data,


family=binomial(link="logit"))
• summary(m1)

• OR: 7.17 (95% CI: 4.27, 7.25)


• What does this estimate mean?

• To guard against residual confounding, may be a good


idea to adjust for covariates in the outcome model
24

References
• Austin PC, Mamdani MM. A comparison of propensity score methods: a case-study estimating the effectiveness of
post-AMI statin use. Statistics in medicine. 2006;25(12):2084-2106.
• Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Sturmer T. Variable selection for propensity score
models. Am J Epidemiol. 2006 Jun 15;163(12):1149-56.
• Cole SR, Hernan MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008
Sep 15;168(6):656-64.
• Hayes JR, Groner JI. Using multiple imputation and propensity scores to test the effect of car seats and seat belt
usage on injury severity from trauma registry data. J Pediatr Surg. 2008;43(5):924-927.
• Howe CJ, Cole SR, Westreich DJ, Greenland S, Napravnik S, Eron JJ, Jr. Splines for trend analysis and continuous
confounder control. Epidemiology. 2011;22(6):874-875.
• Imai K, van Dyk DA. Causal inference with general treatment regimes: generalizing the propensity score. Journal of
the American Statistical Association. 2004;99(467):854-866.
• Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Statistics in medicine.
2010;29(3):337-346.
• Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One. 2011;6(3):e18174.
• Mansson R, Joffe MM, Sun W, Hennessy S. On the estimation and use of propensity scores in case-control and
case-cohort studies. Am J Epidemiol. 2007 Aug 1;166(3):332-9.
• Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010 Feb 1;25(1):1-21.
• Westreich D, Lessler J, Funk MJ. Propensity score estimation: neural networks, support vector machines, decision
trees (CART), and meta-classifiers as alternatives to logistic regression. J Clin Epidemiol. 2010 Aug;63(8):826-33.
• Westreich D, Cole SR, Funk MJ, Brookhart MA, Sturmer T. The role of the c-statistic in variable selection for
propensity score models. Pharmacoepidemiol Drug Saf. 2010 Dec 9.

You might also like