0% found this document useful (0 votes)
9 views

Losing Control (Group) The Machine Learning Control Method For Counterfactual Forecasting

This document proposes a new method called the Machine Learning Control Method (MLCM) to estimate causal effects in panel data settings without a control group. The MLCM uses machine learning to forecast counterfactual outcomes and estimate treatment effects. It can be applied to short panels and complex settings where traditional methods cannot be used, like nationwide policy changes. The method is formally developed within the Rubin Causal Model framework and includes model selection, diagnostics, and placebo tests.

Uploaded by

sergiomanri1997
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Losing Control (Group) The Machine Learning Control Method For Counterfactual Forecasting

This document proposes a new method called the Machine Learning Control Method (MLCM) to estimate causal effects in panel data settings without a control group. The MLCM uses machine learning to forecast counterfactual outcomes and estimate treatment effects. It can be applied to short panels and complex settings where traditional methods cannot be used, like nationwide policy changes. The method is formally developed within the Rubin Causal Model framework and includes model selection, diagnostics, and placebo tests.

Uploaded by

sergiomanri1997
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Losing control (group)?

The Machine Learning Control Method for counterfactual forecasting*

Augusto Cerqua†, Marco Letta‡, Fiammetta Menchetti§

First version: December 30, 2022


This version: July 27, 2023

Abstract Without a control group, the most widespread counterfactual methodologies for causal
panel analysis cannot be applied. We fill this gap with the Machine Learning Control Method
(MLCM), a new technique based on counterfactual forecasting via machine learning. The MLCM is
able to estimate several policy-relevant causal parameters in short- and long-panel settings without
control units. After formalizing the method within the Rubin’s Potential Outcomes Model, we present
simulation evidence and an illustrative application on the impacts of the COVID-19 crisis on income
inequality in Italian local labor markets. We implement the proposed method in the companion R
package MachineControl.

Keywords: counterfactual forecasting, machine learning, no control group, short panels, panel cross-
validation, Rubin Causal Model

JEL-Codes: C13, C18, C53

* We are grateful to Guido Imbens and Fabrizia Mealli for thoughtful conversations and discussions. We also thank Andrea Albanese,
Alessio D’Ignazio, Christina Gatmann, Anna Gottard, Martin Huber, Michael Knaus, Alessandra Mattei, Giovanni Mellace, Andrea
Mercatanti, Jason Poulos, Donato Romano, Jacques-François Thisse, Luca Tiberti, Giuseppe Ragusa and Giuliano Resce for valuable
comments on earlier versions of this work. The paper has also benefited from many helpful comments and suggestions by audiences
at the 2023 LISER International Workshop on “Machine Learning in Program Evaluation, High-dimensionality and Visualization
Techniques”, the Oslo 2023 European Causal Inference Meeting, the AISRe, COMPIE, and SIE 2022 annual conferences and seminar
participants at the Florence Center for Data Science and Sapienza University of Rome. We thank Gabriele Pinto and Federico Rucci
for helpful comments on the implementation.
† Department of Social Sciences and Economics, Sapienza University of Rome, Italy. Email: [email protected]
‡ Department of Social Sciences and Economics, Sapienza University of Rome, Italy. Email: [email protected]
§ Department of Statistics, Informatics, Applications, University of Florence, Italy. Email: [email protected]

Electronic copy available at: https://ssrn.com/abstract=4315389


“In history there are no control groups. There is no one to tell us what might have been.”
Cormac McCarthy, All the Pretty Horses (1992)

1. Introduction

Nowadays, the econometric toolbox for causal panel analysis features many alternative counterfactual
approaches, such as difference-in-differences, synthetic control, and matrix completion methods.
However, all these popular methodologies depend on a key requirement: the availability of a control
group, without which they cannot be applied. This poses a relevant econometric challenge in
observational studies, as there are at least three relevant cases in which a control group does not exist:

i) The treatment simultaneously affects all units. Think, for instance, of a large-scale shock
such as the Great Recession or the COVID-19 pandemic, or a nationwide program for which
there is no counterfactual (Duflo, 2017).

ii) The treatment simultaneously affects most units and the remaining ones are too different
to constitute a valid control group. For example, evaluating the impact of European Union
(EU) common policies is hardly feasible with standard methods, since the (original or
synthetic) control group would necessarily include non-EU countries that are very different
from EU member states.

iii) Only a subgroup of units gets treated, but the set of untreated units cannot form a valid
control group due to violations of the no-interference assumption (Cox, 1958). Spillovers are
very common in aggregate data settings, where the observations are correlated in time, space,
or both, and can lead to very misleading inferences (Sobel, 2006). For instance, if one
estimates the economic effects of Brexit on the UK by relying on a control group made up of
EU countries (e.g., Ireland), the resulting estimates of the causal effect will be biased, as the
Brexit shock engendered substantial spillovers and general equilibrium effects across the EU.

Under these challenging and not uncommon circumstances, standard causal panel data methods
cannot be applied, leaving a methodological gap in the toolbox of empirical economists.1 We fill this
MLCM
gap by introducing the Machine Learning Control Method (MLCM), a new estimator based on
flexible counterfactual forecasting via machine learning (ML). The MLCM leverages a pre-treatment

1
The only exception is the regression discontinuity in time, an adaptation of the regression discontinuity framework to
settings where time is the running variable (Anderson, 2014). However, this approach requires the availability of at least
tens of time periods before and after treatment and relies on much stronger assumptions than the classical regression
discontinuity design (see Hausman and Rapson, 2018).
2

Electronic copy available at: https://ssrn.com/abstract=4315389


information set to forecast unit-level counterfactuals and, in turn, estimate several policy-relevant
causal parameters—including individual, average, and conditional average treatment effects
(CATEs)—in evaluation settings with short (e.g., T ≤ 10) and long panels and no control group.2

The time series literature features some forecasting approaches that researchers have borrowed for
counterfactual building in no-control scenarios. Intuitive approaches include the ‘mean’ and ‘naïve’
methods (Hyndman & Athanasopoulos, 2021), whose working is straightforward. The mean method
estimates, for each unit i, the counterfactual ‘no-treatment scenario’ of the dependent variable 𝑌𝑖 as
the average of its pre-treatment values; under the naïve method, the counterfactual is estimated as the
last pre-treatment value of 𝑌𝑖 . These are very simple ways of estimating counterfactuals even when
the control group is not available, but their accuracy is usually modest. A more complex forecasting
approach is the interrupted time series (ITS) analysis, first formalized by Box and Tiao (1975). ITS
analysis generally fits regression models or autoregressive integrated moving average (ARIMA)
models to the entire time series of observed data; this is done after postulating a structure on the
intervention effect (e.g., constant level shift). By construction, ITS can only deliver an average effect
across the whole post-intervention period. More recent studies (e.g., Brodersen et al., 2015; Menchetti
et al., 2022) have formalized the implementation of time series forecasting such as ARIMA within
the causal framework of the Rubin Causal Model (RCM). However, such methods are not designed
for causal panel analysis, often impose restrictive functional forms, and only work when at least tens
of pre-treatment periods are available—a requirement which is difficult to meet, especially in light of
the multiple structural breaks occurred in the last two decades.3

In recent years, a new literature at the intersection between causal inference and ML has evolved.
Most causal ML techniques, such as causal trees and forests (Athey & Imbens, 2016; Wager & Athey,
2018), and artificial control methods (Carvalho et al., 2018; Masini & Meideros, 2021; Viviano &
Bradic, 2022) harness ML in various ways in order to build a counterfactual scenario, but still
leveraging a set of units assumed to be completely unaffected by the treatment. But since
counterfactual building is ultimately a predictive task, ML tools allow estimation of causal effects
even when control units are not available (Varian, 2016). A handful of studies have leveraged this
intuition to estimate causal impacts in no-control settings (Abrell et al., 2022; Cerqua et al., 2021;
Cerqua & Letta, 2022); however, these works do not formally propose a new methodology, but rather

2
The MLCM accommodates but does not require the short-panel setup and its performance tends to improve when more
pre-treatment information is available.
3
In addition, the Bayesian Structural Time Series approach of Brodersen et al. (2015) requires the availability of a control
time series unaffected by the intervention.
3

Electronic copy available at: https://ssrn.com/abstract=4315389


make a purely applied and intuitive use of ML-based counterfactual forecasting.4

Our approach builds upon and bridges three different methodological currents: causal ML, time series
forecasting, and causal panel data methods. The MLCM can leverage any available supervised ML
technique and is especially suited for aggregate data and complex econometric settings with short
panels and no suitable controls. After formally embedding the method within the RCM, we propose
a model selection strategy based on panel cross-validation (CV). The MLCM comes with a full set of
diagnostic, performance, and placebo tests, and is characterized by a high level of generality and
flexibility: it delivers individual treatment effects that can either be the direct object of interest or can
later be aggregated into several policy-relevant causal parameters. Since our approach allows for
arbitrary and unrestricted treatment effect heterogeneity, we also propose an easy-to-interpret data-
driven search for heterogeneity based on a regression tree. To showcase the applicability potential of
the MLCM, we present an extensive simulation study and an empirical application in which we study
the effects of the COVID-19 pandemic on income inequality in Italian local labor markets (LLMs).
We find that the pandemic led to a sudden increase in inequality, which is particularly marked in
LLMs specialized in tourism, more isolated, and with a lower level of education.

2. The causal framework

In this section, we present the causal framework for the workhorse setup with an observational short-
panel study where the intervention is an extensive policy or shock which affects the entire population,
or the vast majority of it, simultaneously (cases i) and ii) defined above). In such a setting, the
estimation of causal impacts has to rely on estimators expressly developed for a no-control group
scenario.

2.1 The assumptions

Denote with 𝑌𝑖𝑡 the outcome of unit 𝑖 at time 𝑡 and let 𝑊𝑖𝑡 ∈ {0,1} be a random variable describing
the treatment assignment of unit 𝑖 = 1, . . . , 𝑁 at time 𝑡 = 1, . . . , 𝑡0 , 𝑡0 + 1, . . . , 𝑇, where 1 indicates the
treatment, 0 indicates control and 𝑡0 denotes the intervention date.5 As we are focusing on a single
intervention affecting all units simultaneously, we can then write 𝑊𝑖𝑡 = 0 for all 𝑡 ≤ 𝑡0 and 𝑊𝑖𝑡 =
1 for all 𝑡 > 𝑡0 . Under the RCM, the outcome depends on which treatment is received at time 𝑡.
These are called “potential outcomes” and are usually indicated as 𝑌𝑖𝑡 (𝑤𝑖𝑡 ), where the lower case 𝑤𝑖𝑡

4
Furthermore, Abrell et al. (2022) apply an identification strategy that is exclusively based on a linear model (LASSO)
and high-frequency data.
5
The word “treatment'” is commonly used in the context of randomized controlled trials. As we are dealing with an
observational study, we use here, interchangeably, the words “treatment”, “intervention”, “policy”, and “shock'”.
4

Electronic copy available at: https://ssrn.com/abstract=4315389


denotes a realization of 𝑊𝑖𝑡 .6 In such a general framework, we need some identifying assumptions to
estimate the relevant causal estimands.

Assumption 1. Let 𝑋𝑖𝑡 be a vector of covariates that are predictive of the outcome 𝑖 at time 𝑡 ≤ 𝑡0 ;
both the covariates and the potential outcomes are unaffected by the policy in the pre-intervention
period, i.e., for all 𝑡 ≤ 𝑡0, 𝑌𝑖𝑡 (1) = 𝑌𝑖𝑡 (0) and 𝑋𝑖𝑡 (1) = 𝑋𝑖𝑡 (0). The covariates are also unaffected
by the policy in the post-intervention period, i.e., for all 𝑡 > 𝑡0 , 𝑋𝑖𝑡 (1) = 𝑋𝑖𝑡 (0).

This assumption implies that in the pre-intervention period the observed outcome corresponds to the
potential outcome absent the policy, i.e., 𝑌𝑖𝑡 = 𝑌𝑖𝑡 (0). Assumption 1 also requires no anticipatory
behavior by economic agents—i.e., no possibility to alter or manipulate their pre-treatment outcome
values. In our application (see Section 4), for instance, it could be violated if COVID-19 affected
inequality in Italy before its outbreak in 2020, which is not the case. This assumption is easily testable
by verifying whether pre-treatment effects are on average zero. Finally, the second part of
Assumption 1 is only needed if post-treatment covariates values are used to improve the prediction
of the counterfactual outcome absent the policy. Although motivating covariates’ choice is often
sufficient, it can be also verified by testing for the presence of treatment effects on each covariate:
those that are significantly impacted by the intervention must be removed from the model.

We then make a trend predictability assumption:

Assumption 2. Let 𝐼0 be the information set up to time 𝑡 = 0 and let 𝑓𝑌𝑖,𝑡+1 (0)|𝐼𝑡 (𝑦𝑖,𝑡+1 |𝐼𝑡 ) be the
conditional density in the absence of the policy, where 𝐼𝑡 = (𝐼0 , 𝑋𝑖1 , . . . , 𝑋𝑖𝑡 , 𝑌𝑖1 (0), . . . , 𝑌𝑖𝑡 (0) ). Then,

𝑓𝑌𝑖,𝑡+1 (0)|𝐼𝑡 (𝑦𝑖,𝑡+1 |𝐼𝑡 ) = 𝑓𝑌𝑖,1(0)|𝐼0 (𝑦𝑖,𝑡+1 |𝐼𝑡 )

Put simply, the distribution of the potential outcome had the policy not being initiated is constant for
time translations, conditioning on the past information set. Therefore, if we assume to know it in the
pre-intervention period, we would also know it in the post-intervention period. Notice that, while it
is not possible to explicitly test for trend predictability, we can assess how well the model fits the data
in the pre-intervention periods via a rigorous panel CV procedure (described in full detail in Section
3).7 Assumption 2 implies the absence of other simultaneously occurring events or co-interventions

6
Notice that this notation already assumes that present potential outcomes are unaffected by future treatments
(Assumption 1) as well as absence of interference (Assumption 3), which will be discussed next.
7
This assumption is closely related to Assumption 1. To see this, let us assume that the intervention produced an effect
on the outcome before its implementation: this would inevitably alter the conditional distribution of the potential outcome
and, as a result, Assumption 2 would no longer be valid. Therefore, another way to strengthen our faith in this assumption
is to check the predictive performance of the assumed model under different specifications of the intervention date.
5

Electronic copy available at: https://ssrn.com/abstract=4315389


that might affect the outcome of interest. The plausibility of the absence of other concomitant events
affecting post-treatment outcomes requires a careful case-by-case evaluation by the researcher.8

Lastly, we rely on a weaker version of the Stable Unit Treatment Value Assumption (SUTVA)
(Rubin, 1980). Specifically, we substantially relax SUTVA and maintain only its second part: the
treatment is the same for all the units. This is an a priori assumption that the value of 𝑌𝑖𝑡 when exposed
to treatment will be the same no matter what mechanism is used to assign treatment:

Assumption 3. There are no hidden forms of treatment leading to different potential outcomes.

While we maintain the no-multiple-versions-of-treatment assumption, we drop the first part of


SUTVA, namely, the no-interference assumption (Cox, 1958).9 We view this as one of the main
advantages of our approach, because, while it has long been known that violations and failures of the
no-interference assumption can lead to misleading inferences in many social science settings (Sobel,
2006), this strong assumption is rarely questioned or tested in practice (Chiu et al., 2023). This is a
key departure from the vast majority of evaluation methods and, therefore, it is important to delve
into its implications. SUTVA-related interference among units can be of two types: 1) interference
from the treated to the control units; 2) interference among treated units.10 First, we completely
circumvent pitfalls regarding Type-1 interferences, since our counterfactual scenario is generated
using exclusively pre-treatment information. Under Assumptions 1-3, our counterfactual estimates
cannot be contaminated by the treatment. Second, we do not postulate the lack of interference among
treated units and allow for any possible Type-2 interference. We avoid relying on the no-interference-
among-treated assumption because we are aware of the likely presence of interference across both
the temporal and cross-section dimensions in social science applications (Xu, 2022). Yet, under
Assumptions 1-3, the estimate of the counterfactual scenario, i.e., 𝑌𝑖𝑡 (0) for all 𝑡 > 𝑡0 , remains

8
Compared to a popular technique applied to a traditional setting, namely difference-in-differences, this assumption
might, at face value, appear stronger than the parallel trends assumption, as the latter can control for common trends and
shocks affecting both the treatment and control group, which, if supported by careful pre-trend tests, may apparently
reassure about the absence of time-varying omitted variable bias. But controlling for common trends only allows to avoid
omitted variable bias due to large shocks, such as recessions, whose occurrence is easy to know also in our context with
the support of institutional and subject matter knowledge.
9
Following Imbens and Rubin (2015), we consider the case with general equilibrium effects as a scenario in which there
are widespread violations of the no-interference assumption. Hence, general equilibrium effects are simply an (extreme)
example of interference across units, rather than a separate category.
10
Even if the treatment is the same for all units, there may be residual spillover effects due to units’ individual
characteristics. For example, in a study investigating excess mortality from COVID-19, even though the shock affects all
municipalities, those with fewer intensive care beds may send patients in neighboring hospitals, prompting a rise in
mortality rates also there. Spillovers due to individuals’ characteristics within the same treatment group is discussed in
Ogburn & VanderWeele (2014).
6

Electronic copy available at: https://ssrn.com/abstract=4315389


unbiased even in presence of interference among treated units, since we are not interested in
disentangling the direct and indirect effects of the treatment. Indeed, we focus on estimating the total
effect of the treatment, i.e., the direct effect on a unit due to the treatment received coupled with the
indirect effects on the same unit originating from the spillover and general equilibrium effects. We
claim that in scenarios with likely spillover and general equilibrium effects, focusing on the total
effect of the treatment is a sensible choice as it delivers the actual impact on each unit under the
realized treatment assigned mechanism.11 This is the reason why there is no need to invoke any type
of restriction about interferences.12

2.2 Causal estimands

In panel settings, the number of possible causal quantities increases substantially. In this section, we
define estimands that could be of general interest for panel impact studies in the absence of controls.
We also frame them under a finite-sample perspective, since in the application we observe the entire
population of units under study.13

We can first define the Average Treatment Effect (ATE) across the units at a given point in time.

Definition 1. Let 𝜏𝑖𝑡 = 𝑌𝑖𝑡 (1) − 𝑌𝑖𝑡 (0) denote the unit-level effect of the policy at time 𝑡 > 𝑡0 . The
ATE at time 𝑡 > 𝑡0 is defined as

1 1
𝜏𝑡 = ∑𝑁
𝑖 = 1 𝜏𝑖𝑡 = ∑𝑁
𝑖 = 1(𝑌𝑖𝑡 (1) − 𝑌𝑖𝑡 (0)) (1)
𝑁 𝑁

Notice that in our case all units are subject to the treatment; thus, 𝜏𝑡 corresponds to the Average
Treatment Effect on the Treated (ATT).

The next estimand measures the average effect in a subpopulation of units with the same values of
selected covariates. This is commonly known as the CATE and it is of particular interest when there
is reason to believe that the intervention has produced heterogeneous effects on different

11
An alternative evaluation strategy might attempt to disentangle the direct and the indirect impact of the shock/policy,
but this would require making the extremely strong and untestable assumption of knowing exactly how spillover and
general equilibrium effects propagate over time and space.
12
In the rare cases in which there will be no Type-2 interferences, then the MLCM will straightforwardly retrieve the
direct effect of the treatment on each unit, which will simply coincide with the unit-specific total effect.
13
To our knowledge, only Rambachan and Roth (2022) addressed a similar problem and used a finite-sample perspective
for quasi-experimental designs where the sampling-based view is unnatural. However, in their setting there are still
available control units. We add to that literature by proposing novel estimators of causal effects reflecting this 'mixed'
situation, in that, according to the finite-sample perspective inference is drawn only on the units belonging to the sample;
however, only one potential outcome is considered fixed and we assume a structure on the counterfactual formalizing our
uncertainty (since we are not able to observe a single unit under control).
7

Electronic copy available at: https://ssrn.com/abstract=4315389


subpopulations of units as defined by their characteristics.

The typical definition of the CATE under a superpopulation perspective is something along the lines
of 𝜏(𝑥) = 𝐸[𝑌𝑖𝑡 (1) − 𝑌𝑖𝑡 (0)|𝑋𝑖 = 𝑥] (see, e.g., Wager and Athey, 2018). However, this definition
is meaningful only for discrete covariates. Indeed, in the presence of a large number of continuous
covariates, finding two or more units with the exact same covariates values is very difficult, if not
impossible. Therefore, following existing literature on CATE in high-dimensional settings with a mix
of discrete and continuous covariates (Fan et al., 2022; Chernozhukov et al., 2018; Knaus et al., 2021),
we focus on a low-dimensional summary of CATE called “group average treatment effect”.

Definition 2. Let 𝐺𝑖𝑡 ⊂ 𝑋𝑖𝑡 denote a smaller set of individual characteristics and indicate with 𝑁𝑔
the number of units in the population having 𝐺𝑖𝑡 = 𝑔. The group CATE at time 𝑡 > 𝑡0 is defined
as,

1
𝜏𝑡 (𝑔) = ∑𝑖∶ 𝐺𝑖𝑡 =𝑔 𝜏𝑖𝑡 (2)
𝑁𝑔

We remark that in a general panel setting with more than one post-treatment periods, the above
definitions imply the existence of vectors of estimated average effects, i.e., 𝜏𝑡 = (𝜏𝑡0 , 𝜏𝑡0 +1 , . . . , 𝜏 𝑇 )
and 𝜏𝑡 (𝑔) = (𝜏𝑡0 (𝑔), 𝜏𝑡0 +1 (𝑔), . . . , 𝜏 𝑇 (𝑔)). Sometimes, however, researchers are more concerned
with temporal aggregations of such effects.

Definition 3. The temporal average ATE and temporal average group CATE at time 𝑡 > 𝑡0 are
defined, respectively, as,

1
𝜏= ∑𝑇𝑡 = 𝑡0+1 𝜏𝑡 (3)
𝑇−𝑡0

1
𝜏(𝑔) = ∑𝑇𝑡 = 𝑡0+1 𝜏𝑡 (𝑔) (4)
𝑇−𝑡0

Lastly, we emphasize that researchers can first estimate individual treatment effects and then evaluate
what is the most policy-relevant level of aggregation of the estimated effects, even one different from
those presented above.

2.3. Estimation via the MLCM

To estimate these causal quantities, we propose the MLCM. While the details of the algorithm are
discussed in Section 3, here we introduce our estimators based on the MLCM. In short, the algorithm
starts by running several ML methods on pre-treatment data and then selects the one producing the
8

Electronic copy available at: https://ssrn.com/abstract=4315389


most accurate forecasts. Therefore, since it is impossible to know in advance which ML method will
be selected, in this section we maintain the notation as general as possible.14

Denote with 𝑋𝑖,𝑡−ℎ = {𝑋𝑖𝑡 , 𝑋𝑖,𝑡−ℎ , . . . , 𝑌𝑖,𝑡−ℎ } an extended set of covariates including the past ℎ lags
of both 𝑋𝑖𝑡 and 𝑌𝑖𝑡 , with 0 ≤ ℎ ≤ 𝑡 − 1 We assume that the potential outcome absent the policy
follows,

𝑌𝑖𝑡 (0) = 𝑓(𝜒𝑖,𝑡−ℎ ) + 𝜖𝑖𝑡 , 𝜖𝑖𝑡 ∼ 𝑖. 𝑖. 𝑑 (0, 𝜎𝜖2 ) (5)

where 𝑓(⋅) is some flexible function of the extended set of covariates and 𝜖𝑖𝑡 is the error term. Notice
that model (5) assumes poolability, i.e., once we control for a large set of covariates and account for
temporal dynamics by adding lagged outcomes, what remains is random noise. We believe this
assumption can be met in practice by using a large set of covariates, so that all possible sources of
heterogeneity between the units are accounted for.

We also assume that the intervention produces an additive effect on the potential outcomes,15

𝑌𝑖𝑡 (1) = 𝑌𝑖𝑡 (0) + 𝜏𝑖𝑡


Therefore, indicating with 𝐼𝑡0 the information set up to time 𝑡0 and denoting with 𝐸𝑡0 [⋅] the
expectation conditioning on the information set, for any positive integer 𝑘 the estimator of the
counterfactual outcome in the absence of the policy at time 𝑡0 + 𝑘 is,

𝑌̂𝑖,𝑡0+𝑘 (0) = 𝐸𝑡0 [𝑓(𝑋𝑖,𝑡0+𝑘−ℎ )] = 𝑓̂ (𝑋𝑖,𝑡0+𝑘−ℎ ) (6)

In the above expression, 𝑓̂(⋅) denotes the predicted value of 𝑓(⋅). Recall that under Assumption 1, at
time 𝑡 > 𝑡0 we observe 𝑌𝑖𝑡 = 𝑌𝑖𝑡 (1), which, as we have the entire population, is a fixed quantity.
Then, the estimator for the unit-level causal effect at time 𝑡0 + 𝑘 is,

𝜏̂ 𝑖 ,𝑡0+𝑘 = 𝑌𝑖,𝑡0+𝑘 − 𝑓̂(𝑋𝑖,𝑡0+𝑘−ℎ ) (7)

Notice that if 𝑘 − ℎ > 0, we would also be using post-treatment unit-specific covariates to adjust the
prediction of the counterfactual series. However, this could be done only after plausibly motivating

14
We believe that this is an advantage over most existing approaches in the causal ML literature: while Carvalho et al.,
(2018) and Masini and Medeiros (2021) focus on LASSO and Wager and Athey (2018) adopt tree-based methods, the
MLCM is flexible and can easily adapt to different data structures. In this respect, it is more similar to the synthetic learner
proposed by Viviano and Bradic (2022).
15
This is not a restrictive assumption, since it holds after a suitable transformation of the data. For example, we can start
from a multiplicative effect and then apply a logarithmic transformation to recover an additive structure.
9

Electronic copy available at: https://ssrn.com/abstract=4315389


or testing that such covariates are unaffected by the intervention (see Assumption 1). As in our
application the post-intervention period coincides with 2020, the year of the COVID-19 pandemic,
we choose to include only lagged covariates, i.e., pre-treatment information up to 2019; in this way,
we avoid relying on the second part of Assumption 1, which in our specific case would result in the
elimination of a large set of predictors, as COVID-19 generated a widespread impact on many
variables that could be important predictors of our outcome (e.g., population, income). For this
reason, throughout the paper, we only refer to the pre-treatment information set. Building on
Equations (6) and (7), the next definition summarizes the causal effect estimators under the MLCM
approach.

Definition 4 For any positive integer k, 𝑌𝑖,𝑡0+𝑘 = 𝑌𝑖,𝑡0 +𝑘 (1) is the observed outcome in the post-
intervention period under Assumption 1. A finite-sample estimator for ATE at time t0 + k under model
(5) is,

1 1
𝜏̂ 𝑡0 +𝑘 = ∑𝑁
𝑖 = 1 𝜏̂ 𝑖,𝑡0 +𝑘 = ∑𝑁 ̂
𝑖 = 1 𝑌𝑖,𝑡0 +𝑘 − 𝑓 (𝑋𝑖,𝑡0 +𝑘−ℎ ) (8)
𝑁 𝑁

A finite-sample estimator for group CATE at time 𝑡0 + 𝑘 is,

1
𝜏̂𝑡0+𝑘 (𝑔) = ∑𝑁
𝑖: 𝐺𝑖,𝑡 =𝑔 𝜏̂ 𝑖,𝑡0 +𝑘 (9)
𝑁𝑔 0 +𝑘

Finally, estimators for the temporal average ATE and temporal average group CATE are,
respectively,

1
𝜏̂ = ∑𝑇𝑡 = 𝑡0 𝜏̂ 𝑡 (10)
𝑇−𝑡0

1
𝜏̂ (𝑔) = ∑𝑇𝑡 = 𝑡0 𝜏̂ 𝑡 (𝑔) (11)
𝑇−𝑡0

Inference on the estimated causal effects defined above is performed by bootstrap. In particular, we
implement two different bootstrap methods: i) the classic percentile bootstrap and the block bootstrap.
See Appendix A for a detailed description of both methods and the bootstrap algorithms used to derive
confidence intervals for ATE and CATEs.

Finally, Appendix B reports an overview of a selection of ML models which can be implemented


within the MLCM estimation framework.

10

Electronic copy available at: https://ssrn.com/abstract=4315389


3. The Machine Learning Control Method

3.1 Departures from the standard machine learning approach

To forecast the counterfactual scenario defined above, we employ supervised ML techniques, whose
focus is on minimizing the out-of-sample prediction error, generalizing well on future unseen data.
The degree of flexibility is the result of a trade-off: allowing for more flexibility improves in-sample
fit at the cost of reducing out-of-sample fit due to overfitting. ML algorithms tackle this trade-off by
relying on empirical tuning to choose the optimal level of complexity.

The standard ML approach is to randomly split the sample into two sets, containing, for instance, 2/3
and 1/3 of observations. One then uses the first set to train ML algorithms (training set), and the
second to test them (testing set). This introduces a ‘firewall’ principle: none of the data involved in
generating the prediction function is used to evaluate it (Mullainathan and Spiess, 2017). The out-of-
sample performance of the model on the unseen (held-out) data of the testing set can be considered a
reliable measure of the ‘true’ performance on future data. In order to solve the bias-variance trade-
off, one can rely on automatic tuning using tools such as random k-fold CV on the training sample to
select the best-performing values of the tuning parameters in terms of an a-priori defined metric, such
as the MSE.

We depart from this standard ML routine and reorient it towards the counterfactual forecasting goal.
First, we do not randomly split the data, but we train, tune, and evaluate the models only on the pre-
treatment data (Design Stage); then, we use the final selected model to forecast counterfactual post-
treatment outcomes. Second, we do not carry out hyperparameter tuning and model selection with
random k-fold CV, as this would not account for the temporal fabric of the data; instead, we propose
a resampling technique suited for forecasting tasks on longitudinal data—panel CV—which is fully
described below.

Finally, a key concern in ML regards the trade-off between accuracy and interpretability. Such a
trade-off is relevant when ML is used for tasks that also require taking into consideration transparency
and interpretability aspects. In the case of the MLCM, we argue that preserving transparency is
important, because higher interpretability of the estimated counterfactuals helps increase the
credibility of the proposed approach (Abadie, 2021). Since our method can be used with any
supervised ML routine, the ultimate decision should be made on the basis of a comparative
performance assessment across a mix of models characterized by different layers of complexity, by
carefully balancing any improvements in performance from complex models against the loss of
interpretability that comes with their use.
11

Electronic copy available at: https://ssrn.com/abstract=4315389


3.2 Implementation of the MLCM

The implementation of the MLCM requires ten empirical steps, divided into the Design stage and the
Analysis stage. The full process is summarized in Box 1 and described in detail below.

BOX 1: MLCM implementation process

Preliminary: data splitting. Split the full sample on the treatment date: use exclusively pre-
treatment data in the Design Stage.

A. DESIGN STAGE

1) Algorithm selection. Select one or more supervised ML algorithms based on the trade-off
between accuracy and interpretability.

2) Principled input selection. Build a large initial dataset on the basis of domain knowledge. To
maximize forecasting performances, use feature engineering, feature selection, and heuristic
rules from applied predictive modelling to pre-process the data and select a subsample of the
most relevant predictors. Proceed with this smaller set.

3) Panel cross-validation. For each selected algorithm, tune hyperparameters via panel CV (see
Figure 1 for the case of one-step ahead forecasts).

4) Performance assessment. Assess average performance metrics (e.g., MSE) for all the selected
algorithms and check what is the best-performing version of the MLCM.

5) Diagnostic and placebo tests. Implement a battery of diagnostic and placebo tests to show the
accuracy and credibility of the research design.

B. ANALYSIS STAGE

6) Final model selection. On the basis of the comparative performance assessment in the Design
stage, pick up the best-performing model and use only that for the Analysis stage. Start by
re-training the model on the full pre-treatment sample using the hyperparameter(s) selected
in the Design stage.

7) Counterfactual forecasting. For each unit i, forecast the post-treatment counterfactual outcome
𝑌̂𝑖,𝑡𝑜 +𝑥𝑘 .

12

Electronic copy available at: https://ssrn.com/abstract=4315389


8) Estimation of treatment effects. For each unit i, estimate the individual treatment effect in
Equation (8) by taking the difference between the observed post-treatment outcome 𝑌𝑖,𝑡𝑜 +𝑥𝑘
and the ML-generated potential outcome 𝑌̂𝑖,𝑡𝑜 +𝑥𝑘 .

9) Treatment effect heterogeneity. To uncover heterogeneity of causal effects, data-driven


CATEs can be estimated as in Equation (9) via a simple regression tree analysis with the
individual treatment effects as the outcome variable and a host of exogenous predictors
potentially associated with treatment effect heterogeneity. The resulting tree will be built
with data-driven sample splits using only features that are strong predictors of the estimated
treatment effects.
10) Inference. Compute standard errors via block or classic bootstrapping for the ATE and CATEs.

In the Design stage, the first step involves the selection of supervised ML algorithms that will play
the ‘horse-race’ of performance testing on the pre-treatment data.16 In the second step, we recommend
deploying some tweaks involving feature pre-selection and engineering drawing from consolidated
practices in applied predictive modelling (Kuhn & Johnson, 2013). In particular, a key pre-processing
task involves the procedure governing the selection of predictors. There are two main schools of
thought: those who advocate for purely data-driven selection argue that one should build a dataset as
large as possible, and then let the algorithm autonomously decide which variables matter. Others
stress the importance of subject matter knowledge: the researcher should select ex-ante the relevant
predictors, and then feed only those to the algorithm. The rationale is that subject matter knowledge
can separate meaningful information from irrelevant one, eliminating detrimental noise and
enhancing the underlying signal (Kuhn and Johnson, 2013). We propose a hybrid approach: build a
large initial dataset on the basis of domain knowledge, then adopt preliminary and data-driven
variable selection criteria to drop non-informative predictors. As outlined in the causal framework,
counterfactual forecasting is carried out by using an exogenous pre-treatment information set made
up of lagged values of the outcomes and covariates17, but how many lags to include is an empirical
question. We recommend including two or more (depending on panel length) lagged values of both
the outcomes and the covariates and then use a data-driven approach to select a subsample of the most
relevant features. The latter step allows reducing the risk of overfitting and degradation of forecasting

16
It is of course possible to stack several different ML algorithms and form complex ensemble learners, but we refrain
from their use because, as stated above, we care about retaining some degree of transparency in counterfactual building.
17
Including lags of the outcome variable is standard practice in time series forecasting (Hyndman & Athanasopoulos,
2021).
13

Electronic copy available at: https://ssrn.com/abstract=4315389


performances as well as increasing interpretability. Finally, remind that feature engineering and data
pre-processing matter too because how the predictors enter into the model is also important (Kuhn &
Johnson, 2013).

To carry out model selection and validation on the pre-treatment sample, we propose panel CV. In
our setting, using an alternative CV procedure is necessary since ML methods do not natively handle
panel data and are designed for predicting rather than forecasting. Our panel CV approach adapts
time-series validation based on expanding training windows to a panel setting. The intuition ‒ for the
case of one-step ahead forecasts, but the routine is easily adapted to multi-step ahead forecasts ‒ is
provided in Figure 1 and constitutes an adaptation from Hyndman & Athanasopoulos (2021). In short,
we establish cutoff points in the temporal dimension as {𝑡0 − 𝑠, 𝑡0 − 𝑠 + 1, ⋯ , 𝑡0 − 1}, with 𝑠
denoting a positive integer such that 𝑡0 − 𝑠 ≥ 1. At the first CV step, these cutoffs delineate a training
set, 𝑇1 = {𝑌𝑖𝑡 : 𝑡 = 1, ⋯ , 𝑡0 − 𝑠} and a validation set 𝑉1 = {𝑌𝑖𝑡 : 𝑡 = 𝑡0 − 𝑠 + 1}. At the second CV
step, the training set becomes 𝑇2 = {𝑌𝑖𝑡 : 𝑡 = 1, ⋯ , 𝑡0 − 𝑠 + 1} and the validation set 𝑉2 = {𝑌𝑖𝑡 : 𝑡 =
𝑡0 − 𝑠 + 2} and so on, in a rolling window manner until the final CV step, where the validation set
includes the observations in the last time period before the intervention. This sequential procedure
ensures that, for each unit, there are no ‘future’ observations in the training set and no ‘past’
observations in the validation set. At each step, the parameters’ values yielding the best predictive
performance in the validation step (e.g., minimize the MSE) are stored and then averaged across all
CV steps. This also provides summary measures of average model performance on the pre-treatment
data, which can be screened to select the winner of the horse-race.

Figure 1: Panel cross-validation (one-step ahead forecasting)

Note: This procedure is repeated for each unit i in the sample and carried out using exclusively pre-
treatment data. Light green observations form the training sets; dark green ones constitute the test
sets.

To provide evidence about the internal validity of the design, we suggest running diagnostic checks
(e.g., showing that the distribution of the pre-treatment errors is centered around zero) as well as
placebo tests. Following Liu et al. (2022), panel placebo tests can be implemented by hiding one or
14

Electronic copy available at: https://ssrn.com/abstract=4315389


more periods of observations right before the onset of the treatment and using a model trained on the
rest of the pre-treatment periods to predict the untreated outcomes of the held-out period(s). If the
identifying assumptions are valid, the differences between the observed and forecast outcomes in
those periods should be close to zero. Importantly, given that we estimate unit-level treatment effects,
we are able to do more than just showing that the average differences are close to zero (the standard
practice in most event-study designs) and test whether most unit-level placebo differences are close
to zero. The Design stage thus ends with a battery of performance, diagnostic, and placebo tests.

The Analysis stage starts with final model selection and training: on the basis of the comparative
performance assessment in the Design stage, pick up the best-performing model, re-train the model
on the full pre-treatment sample (using the hyperparameter values obtained in the Design stage), and
use it to forecast counterfactual outcomes in the post-intervention period. Once that is done, treatment
effects for each unit are given by the difference between the post-treatment observed data and the
corresponding ML-generated counterfactual forecasts. The average treatment effect is simply the
sample average of individual treatment effects.

Next, data-driven CATEs are computed via a regression tree analysis on the full sample of estimated
treatment effects. Specifically, this approach uses the estimated treatment effects as the outcome
variable, regresses them on many potentially relevant pre-treatment predictors, and lets the algorithm
pick the main predictors and their critical thresholds. The resulting tree reports the average treatment
effects for all LLMs in each terminal node. Like for causal trees and forests (Athey & Imbens, 2016;
Wager & Athey, 2018), this data-driven search for heterogeneity of causal effects removes a major
degree of discretion, because the researcher can only select the set of covariates that can be used by
the tree to build the subgroups. However, the two approaches differ regarding both purpose and
implementation: our data-driven technique is a post-estimation approach aimed at automatically
recovering and visualizing the relevant heterogeneity dimensions, while causal trees and forests are
counterfactual methods for the direct estimation of heterogeneous treatment effects, which they
achieve by leveraging control units. Finally, we are not interested in the out-of-sample performance
of the regression tree, but in retrieving CATEs within the sample and for the entire population of
interest. To this end, there is no need either to split the sample into training and testing sets or to prune
the tree by adjusting the complexity parameter.

Finally, standard errors and confidence intervals for the ATE and CATEs can be estimated through
the block or classic bootstrapping approaches described in Appendix A.

15

Electronic copy available at: https://ssrn.com/abstract=4315389


3.3 Simulation study

To investigate the performance of our proposed approach in detecting average treatment effects in
panel datasets, we performed an extensive simulation study using different data generating processes
(linear and non-linear) and different lengths of the pre-intervention period. More in details, we
generated 500 panel datasets of 100 units each and T = 5, 10, 20 pre-intervention time periods
according to the following two models,


𝑌𝑖𝑡 = 𝜙𝑌𝑖𝑡−1 (0) + 𝑋𝑖𝑡−1 𝛽 + 𝜀𝑖𝑡 (Linear)


𝑌𝑖𝑡 = 𝑠𝑖𝑛 𝑠𝑖𝑛 {𝜙𝑌𝑖𝑡−1 (0) + 𝑋𝑖𝑡−1 𝛽} + 𝜀𝑖𝑡 (Non-linear)

(1) (11)
where: 𝜀𝑖𝑡 ∼ 𝑁(0, 2) , 𝜙 = 0.8 and 𝑋𝑖𝑡−1 = (𝑋𝑖𝑡−1 , … , 𝑋𝑖𝑡−1 ) is a set of 11 covariates measured

at time 𝑡 − 1 , both continuous and categorical, also containing interaction terms and correlated
regressors. Moreover, we allowed the covariates to vary across the units in the dataset by adding a
random term 𝑢𝑖 ∼ 𝑁(0, 0.1) . In particular, the covariates are generated as:

(1) (1) (1)


𝑋𝑖𝑡 = 0.2 𝑡 + 𝑢𝑖 + 𝜈𝑡 , where 𝜈𝑡 ∼ 𝑁(0, 1)
(2) (2) (2)
𝑋𝑖𝑡 = 0.2 𝑡 + 𝑢𝑖 + 𝜈𝑡 , where 𝜈𝑡 ∼ 𝑁(0, 0.2)
1 0.5 0.7
(3,4,5) (3,4,5) (3,4,5)
𝑋𝑖𝑡 = 𝑢𝑖 + 𝜈𝑡 , where 𝜈𝑡 ∼ 𝑀𝑉𝑁(0, 𝛴) and 𝛴 = [0.5 1 0.3] is a variance-
0.7 0.3 1
covariance matrix
(6) (6) (6)
𝑋𝑖𝑡 = 𝑢𝑖 − 𝜈𝑡 , where 𝜈𝑡 ∼ 𝑁(0, 1)
(7) (1) (7) (7)
𝑋𝑖𝑡 = (0.2 𝑡 + 𝜈𝑡 ) 2 + 𝑢𝑖 + 𝜈𝑡 , where 𝜈𝑡 ∼ 𝑁(0, 0.2)
(8)
𝑋𝑖𝑡 ∈ {0, 1}
(9)
𝑋𝑖𝑡 ∈ {1,2,3}
(10) (3) (9)
𝑋𝑖𝑡 = 𝑋𝑖𝑡 ⋅ 𝑋𝑖𝑡
(11) (2) (8)
𝑋𝑖𝑡 = 𝑋𝑖𝑡 ⋅ 𝑋𝑖𝑡

This choice of the covariate set is motivated by the need to keep the simulation study as relevant and
close as possible to real-word empirical settings where interested researchers might apply the
methodology. Indeed, to resemble a real-world typical situation where only a subset of covariates is
relevant, we also set 𝛽 (1) = 𝛽 (8) = 𝛽 (9) = 0. The remaining coefficients are generated as follows:
𝛽 (2) = 𝛽 (6) = 𝛽 (10) = 2, 𝛽 (3) = 𝛽 (7) = 1, 𝛽 (4) = 2.5, 𝛽 (5) = 0.1 and 𝛽 (11) = 1.5. Notice
(3) (5)
that the covariance between 𝑋𝑖𝑡 and 𝑋𝑖𝑡 is 𝜎35 = 0.7 , so the two variables are highly correlated but

16

Electronic copy available at: https://ssrn.com/abstract=4315389


(3) (5)
𝑋𝑖𝑡 is ten times more important than 𝑋𝑖𝑡 . Table 1 below provides an overview of the generated
datasets.

Table 1: First 7 observations from one of the datasets generated during the simulation study

Time ID Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 Y_Lag1

1 1 -0.48 1.60 1.15 2.91 3.66 5.39 0.29 0.56 1 2 5.81 1.15 -2.44

2 1 1.66 1.80 1.59 1.44 3.57 3.68 1.56 1.15 1 3 4.32 1.59 -0.48

3 1 1.48 3.34 1.43 1.88 4.17 4.23 0.34 3.99 0 2 3.76 0.00 1.66

4 1 -0.17 2.05 2.09 1.02 3.15 4.55 0.06 2.04 1 1 1.02 2.09 1.48

5 1 5.45 3.03 1.82 1.06 3.13 2.38 0.85 4.96 0 1 1.06 0.00 -0.17

1 2 1.94 1.35 1.20 2.87 3.61 5.40 0.73 1.04 0 1 2.87 0.00 -1.95

2 2 -1.51 1.20 1.56 1.09 3.44 3.41 1.82 0.94 1 3 3.26 1.56 1.94

At the last time point (e.g., Time = 5 in Table 1) we included a fictional intervention that increases
the outcome for each unit by 2 standard deviations. We made this choice because adding a unit-
specific component in the covariates generates heterogeneity; therefore, the scale of each 𝑌𝑖 varies
across the 𝑖’s. We measure the performance of the MLCM approach both in terms of the bias of the
estimated effect from the true impact and in terms of the interval coverage.

The results are summarized in Table 2 below and show that MLCM achieves a very low bias and
interval coverage very close to the nominal 95% level both under the linear and non-linear model
specifications. We can also observe that the bias tends to decrease when the number of pre-
intervention time periods increases, as more information is present in the data; nevertheless, the bias
at the shortest time period is still very low, which reinforces our belief that the MLCM can be
effectively used for short panels. The bias under a linear model specification is ten times lower than
the non-linear specification, which was also expected since non-linearities are typically more
challenging to detect; however, the bias under a non-linear model specification is still low (the
maximum bias is 7.5% at T = 5). Finally, the interval coverage was estimated based on 1000 bootstrap
iterations and there are no relevant differences between the classic and the block bootstrap, although
the block bootstrap seems to perform slightly better overall.

17

Electronic copy available at: https://ssrn.com/abstract=4315389


Table 2: Simulation results for the linear and non-linear model specifications

Linear Non-linear

T Bootstrap True ATE Bias Rel. Bias Coverage True ATE Bias Rel Bias Coverage

5 Classic 63.95 0.41 0.006 0.95 4.00 0.31 0.077 0.92

10 Classic 83.57 0.24 0.003 0.95 4.11 0.22 0.054 0.95

20 Classic 90.99 0.21 0.002 0.93 4.18 0.20 0.047 0.94

5 Block 63.95 0.41 0.006 0.95 4.00 0.31 0.077 0.93

10 Block 83.57 0.24 0.003 0.94 4.11 0.22 0.054 0.94

20 Block 90.99 0.21 0.002 0.95 4.18 0.20 0.047 0.95

Lastly, Table 3 shows the number of times, in percentage over the 500 simulated datasets, that each
algorithm is selected under the linear and non-linear model specifications. As we included correlated
covariates, the best-performing algorithm under a linear model specification is PLS, whereas RF and
GBM are never selected. Conversely, under the non-linear model specification, GBM dominates
when T = 5, followed by RF; for the longer pre-intervention time periods (T = 10 and T = 20) the
best-performing algorithms are LASSO and PLS.

Table 3: Best-performing algorithms under each simulation scenario


(linear vs. non-linear model specification, varying T)

Linear Non-linear

T LASSO PLS RF GBM LASSO PLS RF GBM

5 0.290 0.710 0 0 0.308 0.018 0.274 0.400

10 0 1 0 0 0.546 0.224 0.050 0.180

20 0 1 0 0 0.474 0.254 0.038 0.234

Notes: The table reports the number of times that each algorithm is selected in percentage of the 500 simulation runs, e.g.,
at T = 5, 0.164 for the LASSO means that the LASSO is selected in 16.4% of the simulation runs.

Empirical application

4.1 Background

Given the unprecedented disruptions brought about by COVID-19, the impact of the pandemic on
18

Electronic copy available at: https://ssrn.com/abstract=4315389


income inequality is an issue of great policy relevance. The available evidence does not provide clear-
cut conclusions. While many micro studies find severe labor income inequalities triggered by the
pandemic recession (Adam-Prassl et al., 2020; Blundell et al., 2020; Galasso, 2020), others document
a decrease in inequality driven by government compensation policies (Clark et al., 2021). The
literature on the inequality effects of past pandemics suggests that they increased income inequality,
with the only exception of the Black Death (Alfani, 2022; Furceri et al., 2022). Regarding Italy,
Galletta and Giommoni (2022) provide granular evidence that the 1918 Spanish flu increased income
inequality in Italian municipalities more affected by the pandemic.

To our knowledge, there is no granular evidence about the income inequality effects brought about
by the pandemic and their heterogeneity across territories. This is mainly explained by econometric
challenges: the absence of suitable control groups caused by the rapid, almost simultaneous spread of
the pandemic across the world. Italy is an important case study, as it ranks among the hardest-hit
countries and was the first Western country to impose a strict lockdown. Given that the lockdown
was nationwide, the treatment ‒ the COVID-19 shock ‒ simultaneously affected all units.18 Therefore,
we leverage the MLCM.

4.2 Data and implementation

We employ LLM yearly data covering the period from 2013 to 2020. We cover all Italian LLMs
except for the 26 LLMs in Trentino-South Tyrol, for a total of 584 LLMs.19 The dependent variable
is the year-to-year change in the log of the Gini index20, while the initial pre-treatment information
set includes over 100 variables (see Table C.1 in Appendix C for a detailed description of the
variables). In this set of covariates, we included the first two lags of all the predictors as covariates
and two lags of the outcome variable. This implies that we collapse the original 2013-2020 dataset

18
We call the treatment variable COVID-19 “shock”, meaning both the pandemic and the lockdown, because we aim to
capture the total effect of the COVID-19 pandemic, i.e., its direct and indirect effects. In this respect, the national
lockdown merely represents a consequence of the COVID-19 pandemic, not a separate treatment. The fact the lockdown
was national other than simultaneous also reassures us that there are no hidden versions of the treatment (Assumption 3).
19
Right before the COVID-19 outbreak, the provincial governments of Trentino and South Tyrol (a ‘special status’ region
endowed with more legislative autonomy compared to ordinary Italian regions) increased the regional surcharge only for
richer individuals declaring an annual income higher than €55,000 for Trento (see here:
https://www.consiglio.provincia.tn.it/leggi-e-archivi/codice-provinciale/Pages/legge.aspx?uid=34300) and €75,000 for
Bolzano. As this policy changes directly affect our outcome of interest (by effectively leading to income redistribution)
and are essentially concomitant to our treatment of interest, they constitute a direct violation of Assumption 2. Therefore,
we exclude Trentino and South Tyrol from the analysis, as it would be impossible to disentangle the impact of the COVID-
19 shock from the effect of these policy changes. This can be seen as an example of the importance of institutional and
domain knowledge when assessing the credibility of the MLCM identifying assumptions.
20
We take the log of the Gini index following Beck et al. (2010), and Galletta and Giommoni (2021)).
19

Electronic copy available at: https://ssrn.com/abstract=4315389


into a dataset covering the period 2016-2020. We then use a data-driven approach to restrict the pre-
treatment information set. More specifically, we follow Athey and Wager (2019) and Basu et al.
(2018) and apply a pilot random forest21 on the pre-treatment data to pick up a subset of the most
important predictors according to the importance ranking produced by the forest. In order to select
the precise number of most important predictors in a data-driven manner, we include this number as
an additional parameter in the subsequent panel CV routine.

The implementation process for this application is reported in Box 2 below, while the outcomes of
the empirical analysis are reported in Section 4.3.

BOX 2: Estimating the granular impact of the COVID-19 shock on income inequality

Preliminary: data splitting. We split the full 2016-2020 dataset into two subsets according to the
treatment date (the 2020 onset of the COVID-19 pandemic): 2016-2019 to be used in the Design
Stage; 2020 to be used in the Analysis Stage.

A. DESIGN STAGE

1) Algorithm selection. We select four supervised ML algorithms: 1) stochastic gradient boosting;


2) random forest; 3) LASSO; 4) Partial Least Squares. As we are agnostic about the
functional form of the underlying data-generating process, we opt for a mix of two fully
nonlinear techniques and two linear models.

2) Principled input selection. We build an initial LLM dataset with over 100 predictors on the
basis of literature insights and subject matter knowledge. From this original dataset, we then
keep only the most important predictors according to a preliminary random forest (Athey
and Wager, 2019; Basu et al., 2018) run on the pre-treatment data.

3) Panel cross-validation. For each algorithm, we tune hyperparameters via panel CV, involving
iterative estimation on three different training-testing pairs of pre-COVID datasets.22 See
Figure 2 below for a graphic summary of the panel CV approach.

21
For this preliminary operation, we use default hyperparameter settings since default parameter choices typically
perform well with random forests (Athey & Wager, 2019).
22
For boosting, we tune the number of trees, the maximum depth of each tree, the minimum number of observations in
terminal nodes and the learning rate; for random forest, we tune the number of variables randomly sampled as candidates
at each split, and use a fixed number of 1,000 trees; for LASSO, we tune the penalty parameter; for Partial Least Squares,
we tune the number of components. All candidate hyperparameter values are tested repeatedly on each different testing
set.
20

Electronic copy available at: https://ssrn.com/abstract=4315389


4) Performance assessment. We assess average performance metrics for the three algorithms by
comparing average forecasted vs. actual outcomes on the 2017-2019 held-out test data. We
then compare the performance of the different MLCM versions among them and with the
mean and naive time series methods.

5) Diagnostic and placebo tests. We first check the average distribution of errors with the best-
performing model for the 2017-2019 testing sets and then, for the best-performing model,
show the map of the unit-level placebo temporal average treatment effects in the pre-COVID
period.

B. ANALYSIS STAGE

6) Final model selection. On the basis of the comparative performance assessment, we pick up the
best-performing model ‒ in our case, boosting ‒ and re-train the model on the 2017-2019
sample using the hyperparameters cross-validated in the Design stage.

7) Counterfactual forecasting. We apply the model estimated in Step 6 on the post-pandemic data
and forecast, for each LLM i, the 2020 counterfactual outcome 𝑌̂𝑖,𝑇𝑜 +1;

8) Estimation of treatment effects. For each LLM i, we estimate the individual treatment effect
by taking the difference between the observed post-COVID outcome 𝑌𝑖,𝑇𝑜 +1 and the ML-
generated potential outcome 𝑌̂𝑖,𝑇𝑜 +1;

9) Treatment effect heterogeneity. We estimate data-driven CATEs via a regression tree analysis
with the individual treatment effects as the outcome and a host of potentially relevant
predictors associated with the heterogeneity of the inequality impacts.23
10) Inference. We estimate standard errors for the ATE and CATEs via block-bootstrapping by
performing 1000 bootstrap replications of Steps 6 to 9.

23
To preserve interpretability, we impose that the minimum number of observations in each terminal node must be at
least equal to 5% of the sample. We use the default value of the complexity parameter equal to 0.01.
21

Electronic copy available at: https://ssrn.com/abstract=4315389


Figure 2: Panel cross-validation in the empirical application

Note: This procedure is repeated for all LLMs in the sample and carried out using exclusively
pre-pandemic data. Light green observations form the training sets; dark green ones
constitute the test sets.

4.3 Results

4.3.1 Design stage

Table 4 below reports the average performance ‒ in terms of Mean Squared Error (MSE) and Mean
Absolute Error (MAE) ‒ of the four selected ML algorithms and of two simple time series forecasting
techniques, namely the intuitive mean and naïve methods, in forecasting the change in the log of the
Gini index across all the available pre-treatment test sets (2017-2019).

The MLCM vastly outperforms the simpler time series methods and the best-performing MLCM
version is the one using gradient descent boosting. The fully nonlinear random forest and boosting
fare quite better than the linear ML methods, suggesting significant nonlinearity in the data-
generating process. It is also worth noticing that all ML methods perform better when adding more
pre-treatment time periods. This means that the MSE reported in Table 4 should be interpreted as an
upper-bound and conservative measure of ML counterfactual forecasting ability. It turns out that the
optimal number of variables for boosting is relatively small as it corresponds to the 11 variables with
the highest importance score attributed by the preliminary forest (see Table C.2 in Appendix C for a
detailed description of these 11 variables).

Importantly, panel CV on pre-treatment period also implicitly allows to carry out an in-time placebo
analysis along the lines of Bertrand et al. (2004) and Liu et al. (2022). In-time placebos are performed
on the same pool of treated units where we “fake” that the treatment occurred at time 𝑡0 − 1 and use
only information up to 𝑡0 − 1 to forecast the counterfactuals at time 𝑡0 . As we know what the real
values at time 𝑡0 are, we can use the difference between the actual values and the forecasted values
to evaluate forecasting accuracy, assess the unbiasedness of our estimator, as well as the distribution

22

Electronic copy available at: https://ssrn.com/abstract=4315389


and spatial autocorrelation of the residuals. The placebo map in Figure 3 depicts the temporal average
(2017-2019) individual ‘treatment’ effects estimated with the best-performing algorithm, gradient
descent boosting, and shows, for virtually all LLMs (except for a few sporadic exceptions), no trace
of significant differences between the forecasted and observed Gini Index changes in the three years
before the pandemic. Finally, a diagnostic test for the best-performing routine - boosting - is reported
in Figure 4, which illustrates that the distribution of the prediction errors is approximately normal and
centered around zero. Moreover, there is no spatial autocorrelation in the estimation error (Moran’s I
index of -0.031).

Figure 3: Unit-level panel placebo test - Temporal average ‘treatment’ effects (2017-2019)

Notes: The standard deviation refers to the 2020 treatment effects. Estimation via the MLCM with boosting.

23

Electronic copy available at: https://ssrn.com/abstract=4315389


Table 4: Performance
Method MSE

MLCM using:

LASSO 0.0001159

Partial Least Squares 0.0001122

Boosting 0.0000910

Random Forest 0.0001001

Time series techniques:

Naïve method (lagged Gini index change) 0.0002513

Mean method (average lagged Gini index change) 0.0001589

Notes: The panel CV procedure has selected the 11 most predictive variables for boosting and
LASSO, and the 13 most predictive variables for random forest and Partial Least Squares. In
addition, for boosting, it has selected 1,000 trees, 1 as the maximum depth of each tree, 8 as the
minimum number of observations in terminal nodes and 0.012 for the learning rate. The other
hyperparameters selected are: 2 for the number of variables randomly sampled as candidates at each
split (random forest), 0.9 as the penalty parameter (LASSO), and 3 as the number of components
(Partial Least Squares).

Figure 4: Distribution of the forecasting error

Notes: average forecasting error of the MLCM estimated with gradient descent boosting.

24

Electronic copy available at: https://ssrn.com/abstract=4315389


4.3.2 Analysis stage

Figure 3 shows the effects of the pandemic on inequality across Italian LLMs and also reports the
ATE. These estimates come from the best-performing technique of the Design stage, the MLCM
using boosting.24 Three main insights stand out:

1) The COVID-19 shock significantly increased average income inequality in Italy. The ATE
is +0.435% and this effect is strongly statistically significant (standard error: 0.0115). This is
not a trivial impact, especially considering that it occurred during the first year of the
pandemic and that the Gini index is a slow-moving variable (Furceri et al., 2022).

2) The average impact masks substantial heterogeneity across the Italian territory. Some
LLMs experienced substantial increases in inequality levels, whereas others even experienced
a reduction in the Gini coefficient. In particular, the MLCM detects positive clusters of
inequality change (i.e., larger than 1 standard deviation, which corresponds to 1.32%) in
Northern and South-Eastern Sardinia, Central and Southern Tuscany, North-Western
Calabria, parts of Sicily, Lombardy, Aosta Valley, and Veneto. Sizable reductions in the Gini
index, instead, are fewer and scattered across the country.

3) The geography of the inequality effects is inconsistent with the spatial distribution of the
epidemiological impacts of the pandemic. Excess deaths during the first wave of COVID-19,
in fact, were overwhelmingly concentrated in Northern Italy and especially in Lombardy
(Cerqua et al., 2021). In contrast, Giommoni and Galletta (2022) found that it was Italian
municipalities more afflicted by the health impacts of the 1918 Influenza pandemic that later
experienced higher income inequality.

Figure 4 reports data-driven CATEs estimated with the regression tree analysis.25 The tree reveals
that the largest increases in inequality occurred in LLMs characterized by tourism specialization
(+1.49%). Among the areas that also suffered from large impacts (+ 1.28%), the tree identifies non-
touristic LLMs having a share of population in peripheral areas higher than 74% and a share of
graduates lower than 20%. Therefore, areas experiencing the largest increase in income inequality

24
Figure C.1 in Appendix C reports the variable importance ranking generated for the 2019 boosting algorithm.
25
For this analysis, we selected, on the basis of domain knowledge, a set of LLM-level variables potentially associated
with the estimated treatment effects, including, among others, pre-pandemic variables such as the 2019 levels of per capita
income, employment rates, the shares of income accruing to the low- and high-income earners, the share of graduates in
2015, populist voting at the 2019 European elections, per capita gambling expenditure (used as a proxy for social distress).
We also include excess deaths registered during the pandemic (using the estimates of Cerqua et al., 2021). The full list of
predictors included in the regression tree CATE model can be found in Table C.3 in Appendix C.
25

Electronic copy available at: https://ssrn.com/abstract=4315389


are local economies predominantly dependent on tourism activities, more isolated, and with a lower
level of education. These impacts are statistically significant.

Figure 5: The granular impact of the COVID-19 pandemic on income inequality (2020)

The largest reductions in the Gini coefficient, instead, are concentrated in non-touristic areas with a
lower share of population in peripheral areas and lower levels of pre-pandemic social distress as
proxied by per capita expenditure on gambling—but this reduction is not statistically significant—
and in non-touristic, non-isolated areas characterized by higher social distress but lower shares of
income accruing to low-income earners and long-run growth rates.

26

Electronic copy available at: https://ssrn.com/abstract=4315389


Figure 6: Data-driven CATEs

Notes: The values within the terminal nodes report the CATEs – measured in percentage change – for all LLMs
grouped within that node. Standard errors are in parentheses.

4. Conclusions

We propose a new causal panel data method based on flexible counterfactual building via machine
learning. The MLCM can employ any available off-the-shelf ML technique, allows for arbitrary and
unrestricted treatment effect heterogeneity, and is suitable for the estimation of a wide variety of
policy-relevant causal parameters in evaluation settings with short (and long) panels and without a
control group. The method is rigorously embedded within the Rubin’s Potential Outcomes Model,
based on panel CV for model selection and validation, and comes with a full set of diagnostic,
performance, and placebo tests. To showcase the MLCM, we presented simulation evidence and an
empirical analysis on the inequality impacts of the COVID-19 crisis in Italy, which revealed a
pandemic-led increase in inequality in 2020 and a sharp heterogeneity of this effect across the Italian
territory. The companion R package MachineControl provides an easy-to-use implementation of the
proposed approach.26 The applicability domain of the MLCM is vast: large-scale shocks, international
economic policies, nationwide policy changes, regional programs and local interventions engendering
widespread spatial spillovers, are all potential cases in which, for various reasons, a control group

26
The R package is available on GitHub at the this link.

27

Electronic copy available at: https://ssrn.com/abstract=4315389


may not exist. In such cases, researchers can now leverage our methodology, which complements the
existing econometric toolbox for causal inference and program evaluation.

More generally, there is much room for broadening the scope of counterfactual panel forecasting with
ML. Our future agenda includes estimation of causal effects of non-binary treatments, a more explicit
incorporation of spatial and temporal dependencies, and extensions to the traditional setting with a
control group.

28

Electronic copy available at: https://ssrn.com/abstract=4315389


References

Abadie, A. (2021). Using synthetic controls: Feasibility, data requirements, and methodological
aspects. Journal of Economic Literature, 59(2), 391-425.

Abrell J, Kosch M, Rausch S, 2022. How effective is carbon pricing?—A machine learning approach
to policy evaluation. Journal of Environmental Economics and Management, 112: 102589.

Adams-Prassl, A., Boneva, T., Golin, M., & Rauh, C. (2020). Inequality in the impact of the
coronavirus shock: Evidence from real time surveys. Journal of Public economics, 189, 104245.

Alfani, G. (2022). Epidemics, Inequality, and Poverty in Preindustrial and Early Industrial Times.
Journal of Economic Literature, 60(1), 3-40.

Anderson, M. L. (2014). Subways, strikes, and slowdowns: The impacts of public transit on traffic
congestion. American Economic Review, 104(9), 2763-96.

Angrist, J. D., & Pischke, J. S. (2010). The credibility revolution in empirical economics: How better
research design is taking the con out of econometrics. Journal of economic perspectives, 24(2), 3-30.

Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic
difference-in-differences. American Economic Review, 111(12), 4088-4118.

Athey, S., & Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal effects.
Proceedings of the National Academy of Sciences, 113(27), 7353-7360.

Athey, S., & Wager, S. (2019). Estimating treatment effects with causal forests: An
application. Observational Studies, 5(2), 37-51.

Basu, S., Kumbier, K., Brown, J. B., & Yu, B. (2018). Iterative random forests to discover predictive
and stable high-order interactions. Proceedings of the National Academy of Sciences, 115(8), 1943-
1948.

Beck, T., Levine, R., & Levkov, A. (2010). Big bad banks? The winners and losers from bank
deregulation in the United States. The Journal of Finance, 65(5), 1637-1667.

Bertrand, M., Duflo, E., and Mullainathan, S. (2004). How much should we trust differences-in-
differences estimates? The Quarterly Journal of Economics, 119(1):249–275.

Blundell, R., Costa Dias, M., Joyce, R., & Xu, X. (2020). COVID‐19 and Inequalities. Fiscal studies,
41(2), 291-319.

29

Electronic copy available at: https://ssrn.com/abstract=4315389


Borusyak, K., Jaravel, X., & Spiess, J. (2022). Revisiting Event Study Designs: Robust and Efficient
Estimation. Available at SSRN 2826228.

Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L. (2015). Inferring causal impact
using Bayesian structural time-series models. The Annals of Applied Statistics, 247-274.

Carvalho, C., Masini, R., and Medeiros, M. C. (2018). Arco: an artificial counterfactual approach for
high-dimensional panel time-series data. Journal of Econometrics, 207(2):352–380.

Cerqua A, Di Stefano R, Letta M, Miccoli S, 2021. Local mortality estimates during the COVID-19
pandemic in Italy. Journal of Population Economics, 34: 1189–1217.

Cerqua A, Letta M, 2022. Local inequalities of the COVID-19 crisis. Regional Science and Urban
Economics, 92: 103752.

Chernozhukov, V., Demirer, M., Duflo, E., & Fernandez-Val, I. (2018). Generic machine learning
inference on heterogeneous treatment effects in randomized experiments, with an application to
immunization in India. National Bureau of Economic Research Working Paper, No. w24678.

Chiu, A., & Lan, X., & Liu, Z., & Xu, Y. (2023). What To Do (and Not to Do) with Causal Panel
Analysis under Parallel Trends: Lessons from A Large Reanalysis Study. Available as SRRN Working
Paper.

Clark, A. E., d’Ambrosio, C., & Lepinteur, A. (2021). The fall in income inequality during COVID-
19 in four European countries. The Journal of Economic Inequality, 19(3), 489-507.

Cox, D. R. (1958). Planning of experiments. Wiley. New York, NY:1958.

Duflo, E. (2017). The economist as plumber. American Economic Review, 107(5), 1-26.

Efron, B. (1981). Nonparametric estimates of standard error: the jackknife, the bootstrap and other
methods. Biometrika, 68(3), 589-599.

Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. In CBMS-NSF Regional
Conference Series in Applied Mathematics.

Efron, B. (1985). Bootstrap confidence intervals for a class of parametric problems. Biometrika,
72(1), 45-58.

Fan, Q., Hsu, Y. C., Lieli, R. P., & Zhang, Y. (2022). Estimation of conditional average treatment
effects with high-dimensional data. Journal of Business & Economic Statistics, 40(1), 313-327.
30

Electronic copy available at: https://ssrn.com/abstract=4315389


Furceri, D., Loungani, P., Ostry, J. D., & Pizzuto, P. (2022). Will COVID-19 have long-lasting effects
on inequality? Evidence from past pandemics. The Journal of Economic Inequality, 1-29.

Galasso, V. (2020). COVID: Not a great equalizer. CESifo Economic Studies, 66(4), 376-393.

Galletta, S., & Giommoni, T. (2022). The effect of the 1918 influenza pandemic on income inequality:
Evidence from Italy. Review of Economics and Statistics, 104(1), 187-203.

Hausman, C., & Rapson, D. S. (2018). Regression discontinuity in time: Considerations for empirical
applications. Annual Review of Resource Economics, 10, 533-552.

Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: principles and practice, 3rd edition,
OTexts: Melbourne, Australia. OTexts.com/fpp3.

Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences.
Cambridge University Press.

Knaus, M. C., Lechner, M., & Strittmatter, A. (2021). Machine learning estimation of heterogeneous
causal effects: Empirical monte carlo evidence. The Econometrics Journal, 24(1), 134-161.

Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling (Vol. 26, p. 13). New York: Springer.

Laffers, L., & Mellace, G. (2020). Identification of the average treatment effect when SUTVA is
violated. Discussion Papers on Business and Economics, University of Southern Denmark, 3.

Liu, L., Wang, Y., & Xu, Y. (2022). A practical guide to counterfactual estimators for causal inference
with time-series cross-sectional data. American Journal of Political Science, forthcoming.

Masini, R., & Medeiros, M. C. (2021). Counterfactual Analysis With Artificial Controls: Inference,
High Dimensions, and Nonstationarity. Journal of the American Statistical Association, 116(536),
1773-1788.

Menchetti, F., Cipollini, F., & Mealli, F. (2022). Combining counterfactual outcomes and ARIMA
models for policy evaluation. The Econometrics Journal, utac024.

Mullainathan, S., & Spiess, J. (2017). Machine learning: an applied econometric approach. Journal
of Economic Perspectives, 31(2), 87-106.

Norris, P., & Inglehart, R. (2019). Cultural backlash: Trump, Brexit, and authoritarian populism.
Cambridge University Press.

31

Electronic copy available at: https://ssrn.com/abstract=4315389


Ogburn, E.L., & VanderWeele, T.J. (2014). Causal diagrams for interference. Statistical Science,
29(4), 559-578.

Rambachan, A. and Roth, J. (2022). An honest approach to parallel trends. The Review of Economic
Studies, forthcoming.

Rubin DB, 1974. Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal of Educational Psychology, 66: 688–701.

Sela, R. J., & Simonoff, J. S. (2012). RE-EM trees: a data mining approach for longitudinal and
clustered data. Machine learning, 86(2), 169-207.

Sobel, M. E. (2006). What do randomized studies of housing mobility demonstrate? Causal inference
in the face of interference. Journal of the American Statistical Association, 101(476), 1398-1407.

Varian, H. R. (2016). Causal inference in economics and marketing. Proceedings of the National
Academy of Sciences, 113(27), 7310-7315.

Viviano, D., & Bradic, J. (2022). Synthetic learner: model-free inference on treatments over
time. Journal of Econometrics, forthcoming.

Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using
random forests. Journal of the American Statistical Association, 113(523), 1228-1242.

Xu, Y. (2022). Causal Inference with Time-Series Cross-Sectional Data: A Reflection. The Oxford
Handbook for Methodological Pluralism, forthcoming.

32

Electronic copy available at: https://ssrn.com/abstract=4315389


Appendix A – Inference

A.1. Bootstrap inference for ATE confidence intervals

Inference on causal effects estimated with the MLCM is performed by bootstrap. In particular, we
implement two different bootstrap methods: i) the classic percentile bootstrap and the block bootstrap.
The former was introduced by Efron with a series of articles (Efron 1981, 1982, 1985) and consists
in resampling with replacement the unit-time pairs and re-estimating the parameter of interest for
each bootstrap sample. This generates a bootstrap distribution for the parameter and we can then
define a (1 − 𝛼) confidence interval by simply taking the 𝛼/2 and the 1 − 𝛼/2 quantiles of that
distribution. The classic bootstrap algorithm for the derivation of ATE confidence intervals is
described in Table A.1 below.

Table A.1: Classic bootstrap algorithm for ATE

For 𝑏 = 1, ⋯ , 𝐵, where 𝐵 is the total number of bootstrap iterations, do

1. Sample with replacement the tuple (𝑌𝑖𝑡 ∗ , 𝑋𝑖𝑡−ℎ ∗ )(𝑏)

2. Compute the bootstrap replication of the ATE, i.e., 𝜏̂𝑡 ∗ (𝑏)

End for.

Let 𝐺̂ (𝑐) = #{𝜏̂𝑡 ∗ (𝑏) ≤ 𝑐 }/𝐵 be the empirical cumulative distribution function of the B
bootstrap replications. Compute a (1 − 𝛼) percentile interval as [𝐺̂ −1 (𝛼/2), 𝐺̂ −1 (1 −
𝛼/2) ]

However, since we are dealing with a panel dataset that has an inherent temporal structure, resampling
the unit-time pairs independently might not be ideal: although we are assuming poolability, if the
time series are non-stationary, possible model misspecifications may lead to not fully remove the
autocorrelation from model’s residuals (e.g., if there are second-order dependencies and only one lag
of the outcome is included). Therefore, it may still be worth sampling the units instead of the unit-
time pairs, so that if a unit is sampled, its entire evolution over time is sampled as well. This way, the
inference is more robust to misspecification issues. This method is in the spirit of the original block
bootstrap introduced by Hall (1985), Carlstein (1986), Kunsh (1989) and Liu and Singh (1992) that
consists in dividing a time series in blocks (overlapping or non-overlapping) and then sampling the
blocks with replacement. Note that block-bootstrapping with panel data has also been implemented
in other recent causal machine learning estimators (Viviano & Bradic, 2022). Since we have a panel
dataset, we can consider each unit to form one block. The full algorithm to derive block-bootstrap
confidence intervals in our setting is detailed in Table A.2 below.
33

Electronic copy available at: https://ssrn.com/abstract=4315389


Table A.2: Block bootstrap algorithm for ATE.

For 𝑏 = 1, ⋯ , 𝐵, where 𝐵 is the total number of bootstrap iterations do

1. Sample 𝑁 units with replacement

2. Only for the resampled units, retain the tuple (𝑌𝑖𝑡 ∗ , 𝑋𝑖𝑡−ℎ ∗ )(𝑏) for all 𝑡 = 1, ⋯ , 𝑇

3. Compute the bootstrap replication of the ATE, i.e., 𝜏̂𝑡 ∗ (𝑏)

End for.

Let 𝐺̂ (𝑐) = #{𝜏̂𝑡 ∗ (𝑏) ≤ 𝑐 }/𝐵 be the empirical cumulative distribution function of the B
bootstrap replications. Compute a (1 − 𝛼) percentile interval as [𝐺̂ −1 (𝛼/2), 𝐺̂ −1 (1 −
𝛼/2) ]

A.2. Bootstrap inference for CATE confidence intervals

Deriving confidence intervals for the group CATE is more challenging than for ATE. The reason is
that CATE is estimated with a regression tree where the outcome is a collection of individual causal
effects. Therefore, if we resample the unit-time pairs (or the units only) at the very beginning and
then we estimate B different CATEs, we will end up with very different trees: both the splits and the
observations within each terminal node will likely differ. For this reason, we propose to resample
directly the observations within each terminal node of the tree. Note that the observations are the
estimated individual causal effects (computed by comparing the observed data with the ML
predictions). Our estimand of interest is the average of the individual effects, so at each bootstrap
iteration we resample the units in each terminal node, averaging the individual effects and obtaining
a bootstrap distribution for the CATE. The full algorithm to derive classic confidence intervals is
described in Table A.3 below.

Table A.3: Block bootstrap algorithm for CATE.

For b= 1, … , 𝐵 where B is the total number of bootstrap iterations and for 𝑚 = 1, … , 𝑀


where 𝑀 is the total number of terminal nodes of the tree, do

1. Sample with replacement the 𝑵𝒎 units in terminal node 𝒎


∗ ∗ (𝑏)
2. Retain the tuple (𝜏̂𝑖𝑡 , 𝑋𝑖𝑡−ℎ )𝑚 for all t = 1, ⋯, T

3. Compute the bootstrap replication of the ATE in the terminal node 𝒎, i.e., 𝜏̂𝑡∗ (𝑚)

4. By definition (9) 𝜏̂ 𝑡∗ (𝑚) is an estimator of CATE, since we are averaging the


individual effects among the units in group (terminal node) 𝑚, i.e., we are
conditioning on covariates

34

Electronic copy available at: https://ssrn.com/abstract=4315389


End for.

Let 𝐺̂ (𝑐) = #{𝜏̂ 𝑡∗ (𝑚) ≤ c}/𝐵 be the empirical cumulative distribution function of the
𝒎 −group CATE in the B bootstrap replications. Compute a (1- α) percentile interval as
[ 𝐺̂ −1 (𝛼/2), 𝐺̂ −1 (1 − 𝛼/2) ]

35

Electronic copy available at: https://ssrn.com/abstract=4315389


Appendix B – Models used in the empirical application

Some comments on model specification (5) in the context of our empirical application are in order:

• Modeling potential outcomes in the absence of the policy reflects our uncertainty on Yit(0)
rather than an assumption on the distribution of potential outcomes in the superpopulation, as
we already observe the entire population under study;

• Although model (5) is in a general form, in our specific empirical study we model the first
difference of the outcome variable. In this way, we are able to eliminate the first-order
autocorrelation. Furthermore, in our specific case of a single post-intervention period, this
does not affect the definition of the causal effect estimators.

To get further insights on the latter comment, let 𝛥𝑌𝑖𝑡 = 𝑌𝑖,𝑡 − 𝑌𝑖,𝑡−1 denote the first difference of the
observed outcome. Then, the predicted counterfactual in the absence of the policy at time 𝑡0 + 1 under
model (5) is,

̂ (𝑋𝑖,𝑡 +1−ℎ )
𝛥𝑌̂𝑖,𝑡0+1 (0) = 𝑌̂𝑖,𝑡0+1 (0) − 𝑌̂𝑖,𝑡0 (0) = 𝛥𝑓 0

We can show that the unit-level effect on the differenced outcome at time 𝑡0 + 1 is exactly equal to
the original effect, i.e.,

𝛥𝑌𝑖,𝑡0+1 − 𝛥𝑌̂𝑖,𝑡0 +1 (0) = 𝑌𝑖,𝑡0 +1 − 𝑌𝑖,𝑡0 − (𝑌̂𝑖,𝑡0+1 (0) − 𝑌̂𝑖,𝑡0 (0)) =

= 𝑌𝑖,𝑡0+1 − 𝑌̂𝑖,𝑡0 +1 (0) = 𝜏̂ 𝑖,𝑡0+1

where the last equality follows from Assumption 1. Therefore, by training the MLCM on the first
difference of the outcome, we eliminate first-order autocorrelation and directly obtain the estimated
effect on the original variable. Nevertheless, we remark that the proposed MLCM approach can easily
be implemented also in settings where the prediction is done for more than one time period and
without differencing the outcome variable.

As discussed above, to learn the function 𝑓(⋅) we test several ML methods. Thus, we now provide
some examples of how the causal effect estimators look like for some of the supervised ML methods
that can be employed by MLCM.

● LASSO

𝐸𝑡0 [𝑓(𝑋𝑖,𝑡0 +𝑘−ℎ )] = 𝛽̂ 𝑋′𝑖,𝑡0+𝑘−ℎ

where 𝛽̂ is estimated as in Tibshirani (1996) by minimizing a penalized version of the residual


sum of squares.
36

Electronic copy available at: https://ssrn.com/abstract=4315389


● Regression Tree

𝐸𝑡0 [𝑓(𝑋𝑖,𝑡0 +𝑘−ℎ )] = 𝑇(𝑥𝑖𝑡 ; 𝑅𝑌 , 𝜇) = ∑ 𝜇𝑚 I𝑥𝑖𝑡 ∈ 𝑅𝑚


𝑚=1

where 𝑇(⋅) denotes a tree and 𝑅𝑌 is a binary recursive partition of the covariates space, 𝑚 is
the number of leaf nodes and 𝜇𝑚 are the parameters.

● Random Forests

𝐵
1
𝐸𝑡0 [𝑓(𝑋𝑖,𝑡0+𝑘−ℎ )] = ∑ 𝑇 (𝑏) (𝑥𝑖𝑡 ; 𝑅𝑌 , 𝜇)
𝐵
𝑏=1

where 𝐵 is the number of trees forming the forest.

● Gradient Boosting

𝐸𝑡0 [𝑓(𝑋𝑖,𝑡0+𝑘−ℎ )] = 𝐸𝑡0 [ ∑ 𝛽𝑚 ℎ𝑚 (𝑥𝑖,𝑡0+𝑘−ℎ )]


𝑚=1

where, following Friedman (2001) and Friedman (2002), ℎ𝑚 (𝑥) are the so-called “base” learners
chosen to be simple functions of the covariates over M samples.

37

Electronic copy available at: https://ssrn.com/abstract=4315389


Appendix C – Additional application material

Table C.1 – Definition of the variables included in the initial dataset

Variable name Definition Time period Source

Dependent variable

Change in the log of Year-to-year change in the log of the Gini index 2013-2020 Ministry of Economy
the Gini index and Finance (MEF)

Time-invariant variables

Economic Without specialization, non-manufacturing 2011 Italian National


classification (touristic), non-manufacturing (non-touristic), Institute of Statistics
dummies made in Italy, other manufacturing (Istat)

Dummy district LLM with at least one industrial district 2011 Istat

Population dummies ≤10,000; (10,000; 50,000]; (50,000; 2011 Istat


100,000]; (100,000; 500,000];
>500,000

Postal offices Number of postal offices 2012 Poste Italiane

Area Area of the LLM in squared Km 2011 Istat

Share of individuals Share of individuals aged 30-34 with a 2014 Istat


with a university university degree
degree

Share of individuals Share of individuals aged 30-34 with a high- 2014 Istat
with a high-school school degree
degree

Share of the area at Share of the area of the LLM at high-risk of 2012 Italian National
risk of landslides landslides Institute for
Environmental
Protection and
Research (Ispra)

Share of urban area Urban surface / Total surface 2012 Istat

Seismic area Categorical variable on the risk of an 2003 National Institute of


earthquake Geophysics and
Volcanology (INGV)

Share of population in Share of population located in municipalities 2011 Istat


the periphery considered as peripheral or ultra-peripheral
according to the SNAI classification

38

Electronic copy available at: https://ssrn.com/abstract=4315389


Table C.1 – Continued

Altitude Altitude of the highest city centre in the LLM’s Istat


municipalities

EUAP Dummy for the presence of protected natural 2010 Istat


areas

NAT00 Dummy for the presence of nature protection 2000 Istat


areas belonging to Natura 2000

Volcanic area Dummy volcanic area 2012 INGV

Seaside Dummy seaside Istat

Time-varying variables

Share of foreign Foreigners / population 2013-2019 Istat


population

Unemployment rate Resident population aged 15+ not in 2013-2019 Istat


employment but currently available for work

Activity rate The number of people employed and those 2013-2019 Istat
unemployed as a % of the total population

Share of graduate Share of municipalities with a mayor with a 2013-2019 Ministry of the
mayors university degree Interior

Share of recycled Share of recycled waste 2013-2019 Ispra


waste

Waste Total waste (absolute value) 2013-2019 Ispra

Share of workers in Share of workers in manufacturing 2013-2019 Istat


manufacturing

Share of old Share of population aged >=65 2013-2019 Istat


population

Number of weddings Number of weddings per 1,000 inhabitants 2013-2019 Istat


per 1,000 inhabitants

Share of beds in 4- or Share of beds in 4- or 5-star hotels 2013-2019 Istat


5-star hotels

Road accidents Number of road accidents per 1000 inhabitants 2013-2019 Istat

Newborns Number of newborns per 1,000 inhabitants 2013-2019 Istat

Deaths Number of deaths per 1,000 inhabitants 2013-2019 Istat

39

Electronic copy available at: https://ssrn.com/abstract=4315389


Table C.1 – Continued
Share of individuals Share of individuals who declared up to 26K per 2013-2019 MEF
who declared up to year
26K

Share of individuals Share of individuals who declared >= 75K per 2013-2019 MEF
who declared >= 75K year

Declared income per Declared income per capita 2013-2019 MEF


capita

Declared income (total) Declared income (total) 2013-2019 MEF

Share of income for Share of overall declared income for pensions 2013-2019 MEF
pensions

Share of income for Share of overall declared income for employment 2013-2019 MEF
employment

Share of income for Share of overall declared income for self- 2013-2019 MEF
self-employment employment

Share of income for Share of overall declared income for 2013-2019 MEF
entrepreneurship entrepreneurship

Share of income for Share of overall declared income for buildings 2013-2019 MEF
buildings (rendite)

Share of income for Share of overall declared income for financial 2013-2019 MEF
financial activities activities

Total revenues from Total revenues from regional surcharge on 2013-2019 MEF
regional surcharge personal income

Total revenues from Total revenues from municipality surcharge on 2013-2019 MEF
municipality surcharge personal income

Number of individuals Number of individuals with a positive declared 2013-2019 MEF


with a positive income income

Number of workers Number of workers 2013-2019 Istat

Labor force Labor force 2013-2019 Istat

Hotel beds Number of hotel beds 2013-2019 Istat

Population Resident population 2013-2019 Istat

Average price per Average price per square meter - house 2013-2019 Osservatorio del
square meter Mercato Immobiliare
– Agenzia delle
Entrate

Notes: In our application, time-invariant covariates are either features of the LLM not subject to time changes (e.g. area
of the LLM) or features taken at a time before 2016.

40

Electronic copy available at: https://ssrn.com/abstract=4315389


Table C.2 – Definition of the variables included in the estimation process (boosting)

Variable name Definition Time period Source

Dependent variable

Change in the log of Year-to-year change in the log of the Gini index 2013-2020 MEF
the Gini index

Predictors

Change in the log of Year-to-year change in the log of the Gini index 2013-2019 MEF
the Gini index (first lag)

Growth rate of the Year-to-year growth rate of the total declared 2013-2019 MEF
declared income income (first lag)

Growth rate of the Growth rate of the number of individuals with a 2013-2019 MEF
number of individuals positive declared income (first lag)
with a positive
income

Change in the share of Change in the share of overall declared income 2013-2019 MEF
income for pensions for pensions (first and second lags)

Change in the share of Change in the share of individuals who declared 2013-2019 MEF
individuals who up to 26K (first lag)
declared up to 26K

Change in the share of Change in the share of income for financial 2013-2019 MEF
income for financial activities (first lag)
activities

Change in the share of Change in the share of overall declared income 2013-2019 MEF
income for buildings for buildings (first and second lags)

Change in the share of Change in the share of overall declared income 2013-2019 MEF
income for for employment (second lag)
employment

Change in the Change in the resident population aged 15+ not 2013-2019 Istat
unemployment rate in employment but currently available for work
(second lag)

Notes: These are the 11 variables selected via the panel CV procedure by boosting.

41

Electronic copy available at: https://ssrn.com/abstract=4315389


Table C.3 – Definition of the variables included in the data-driven CATEs analysis

Variable name Definition Time period Source

Change in the log of the Gini Treatment effect of the COVID-19 crisis 2020 Estimated via the
index in 2020 (dependent on the Gini index (%) MLCM with
variable) boosting

Share of income accruing to low- Share of income accruing to individuals 2019 MEF
income earners with an overall income ≤€26,000

Share of income accruing to high- Share of income accruing to individuals 2019 MEF
income earners with an overall income >€75,000

Long-run income growth rate Compound annual real growth of 2001-2019 MEF
income per head

Tourism LLM Dummy variable equal to 1 for LLMs 2011 Istat


specialized in tourism

Populist score Variable created by first averaging the 2019 Authors’


scores of the authoritarian (right-wing) elaboration based
and the anti-elite component of on Norris and
populism (Norris and Inglehart, 2019) Inglehart’s scores
and then by multiplying the voting share (2019)
of each party by its corresponding
average score at the 2019 European
elections

Share of individuals with a Share of individuals aged 30-34 with a 2015 Istat
university degree university degree

Population Resident population 2019 Istat

Per capita expenditure on Per capita expenditure on gambling 2019


gambling

Unemployment rate Resident population aged 15+ not in 2019 Istat


employment but currently available for
work

Excess mortality estimates Municipality-level excess mortality From Feb 21, Cerqua et al.
estimated by applying ML techniques to 2020 to Sep (2021)
all-cause deaths data, aggregated at the 30, 2020
LLM level

Share of temporary contracts Number of employees with temporary 2015 Istat


contracts in October divided by the
number of employees in October

Share of jobs in suspended Share of jobs in activities suspended in 2017 Istat


economic activities March 2020 by the Italian Government
due to the spread of the pandemic

42

Electronic copy available at: https://ssrn.com/abstract=4315389


Table C.3 – Continued

Per capita income The amount of money earned per person 2019 MEF

Share of innovative start-ups The ratio between innovative start-ups Average Business Register
and the universe of firms registered in (2016-2019)
the Business Register

Share of firms having employees The number of firms with employees in Average Ministry of Labor
in CIGS CIGS divided by the universe of firms (2015-2018) and Social Policies
registered in the Business Register

Share of population living in Share of population living in areas Jan 1, 2020 Istat
peripheral areas defined by Istat as peripheral or ultra-
peripheral

Dependency ratio The ratio of those typically not in the Jan 1, 2020 Istat
labor force (the dependent part, ages 0 to
14 and 65+) and those typically in the
labor force (the productive part, ages 15
to 64)

Index of relational intensity The percentage of flows within an LLM 2011 Istat
(IIRFL) that connect different municipalities on
the total of flows within the LLM. This
indicator ranges from values close to 0
to 100 (case in which all the workers of
the municipalities of the LLM go to
work in another municipality). The
higher the indicator, the greater the inter-
municipal turbulence in terms of flows

43

Electronic copy available at: https://ssrn.com/abstract=4315389


Figure C.1: Variable importance ranking - 2019 Boosting model

44

Electronic copy available at: https://ssrn.com/abstract=4315389

You might also like