Combining Random Forest and Copula Functions, De Luca
Combining Random Forest and Copula Functions, De Luca
See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
INTELLIGENT SYSTEMS IN ACCOUNTING, FINANCE AND MANAGEMENT
Intell. Sys. Acc. Fin. Mgmt. 17, 91–109 (2010)
Published online in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/isaf.315
SUMMARY
In this paper we propose a heuristic strategy aimed at selecting and analysing a set of financial assets, focusing
attention on their multivariate tail dependence structure. The selection, obtained through an algorithmic procedure
based on data mining tools, assumes the existence of a reference asset we are specifically interested to. The
procedure allows one to opt for two alternatives: to prefer those assets exhibiting either a minimum lower tail
dependence or a maximum upper tail dependence. The former could be a recommendable opportunity in a finan-
cial crisis period. For the selected assets, the tail dependence coefficients are estimated by means of a proper
multivariate copula function. Copyright © 2010 John Wiley & Sons, Ltd.
1. INTRODUCTION
In recent decades financial markets have been characterized by an increasing globalization, and a
complex set of relationships among asset returns has been established. In spite of these close connec-
tions, cross-market correlation coefficients rarely assume significantly high values. In this context,
Engle (2002) has demonstrated that we should preferably analyse dynamic correlation, showing that
asset returns are positively associated conditionally on market volatility. The presence of a stronger
dependence when markets are more volatile (and especially during crises) suggests investigating the
presence of an appreciably higher association in the tails of the joint distribution. In the literature this
phenomenon is known as tail dependence. The main feature of joint distributions characterized by tail
dependence is the presence of heavy and possibly asymmetric tails; thus, the traditional hypothesis of
(multivariate) Gaussianity is completely inadequate. In the absence of a reasonable alternative distri-
butional assumption, a copula approach can be particularly interesting.
Copula functions are effective quantitative tools for modelling the joint dependence of random
variables; for example, see Joe (1997), Cherubini et al. (2004) and Nelsen (2006). The use of copula
functions in finance is recent and the history of its rapid growth can be read in Genest et al. (2009).
Applications of copula functions to bivariate financial time series have been carried out for capturing
the dynamics of the dependence structure (Jondeau and Rockinger, 2006; Patton, 2006; Bouyé and
* Correspondence to: Paola Zuccolotto, Dipartimento Metodi Quantitativi, C.da S. Chiara, 50-25122 Brescia, Italy.
E-mail: [email protected]
Salmon, 2009), for estimating the value-at-risk (Palaro and Hotta, 2006) or for measuring the tail
dependence (Fortin and Kuzmics, 2002).
The main advantage of copula functions is that they allow us preliminarily and separately to model
the marginal distributions, which are then joined into a multivariate distribution. A second desirable
property is that some copula functions imply very flexible joint distributions, able to fulfil an in-depth
analysis of the tail dependence structure. Unfortunately, this is not the case of the most common copula
family, the elliptical family, including the Gaussian and Student-t, which suffer from an absent or
symmetric lower and upper tail dependence respectively. On the other hand, the Archimedean family
allows for different lower and upper tail dependence.
Owing to the complex structure of financial markets, a high-dimensional multivariate approach to
tail dependence analysis is surely more insightful. However, a growing number of jointly modelled
variables hugely and rapidly increases formal and computational complexity. In an analysis of geo-
graphical indices, for example, a joint study of all the markets in the world is impossible. In general,
even a drastic restriction of the study to the so-called developed markets is not sufficient to provide a
manageable number of variables.
To cope with the dimensionality problem, a number of strategies based on the reduction of a mul-
tivariate copula to a cascade of bivariate copulae can be found in the literature (see Aas and Berg
(2009) for a detailed comparison of the proposed techniques). Alternatively, a selection procedure of
the most suitable (according to some definite rule) assets is necessary. However, in high-dimensional
contexts, the selection procedure itself can be computationally burdensome.
In this paper we propose to realize the selection using data mining tools. We face the problem with
a heuristic reasoning and we propose an algorithmic procedure, based on the recent Random Forest
technique—see Breiman (2001)—in order to opportunely select the assets we want to introduce in an
analysis aimed at investigating tail dependence. The selection is built with a hierarchical structure
around an asset we are interested in. In other words, we first choose a reference asset we want to
necessarily include in the set of analysed assets, as frequently happens in investment strategies, and
then, after filtering the data from autocorrelation and heteroskedasticity, we select step by step the
other assets, by adding an asset at each step until a termination criterion is satisfied. For the selected
assets, we propose to use a copula approach in order to estimate the tail dependence coefficients.1
Thus, the aim of this paper is to propose a structured procedure for selecting and analysing a set of
financial assets, focusing attention on their joint tail dependence.
Given a large set of financial assets, the proposed strategy is organized in three steps:
1. Application of univariate models to the financial returns, in order to filter the data from autocor-
relation and heteroskedasticity.
2. Selection of k financial assets (including the reference one) whose returns exhibit a low (high) level
of lower (upper) tail dependence. In detail, in an investment perspective, we could aim at including
in a portfolio assets with low association in the case of negative shocks, or assets with high asso-
ciation in the case of positive shocks. The choice between the two strategies depends upon the
expectation about the future trends of the financial markets and upon the desired risk degree of the
investors. In this paper we mimic a financial crisis perspective; hence, we will focus on the selec-
tion of assets with low association in the lower tail. The extension to the upper tail is straightforward.
3. Estimation of the multivariate tail dependence coefficients.
1
An alternative idea in this context is to model the tails of the multivariate distributions using extreme value theory; see
McNeil (1999).
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 93
The paper is organized as follows. In Section 2 the theory of copula functions is briefly recalled
and the tail dependence coefficients are defined. Section 3 describes an empirical problem showing
that a selection procedure using copula functions is computationally unfeasible. Section 4, after briefly
recalling the Random Forest algorithm, illustrates the functioning of the proposed asset selection
procedure, also presenting the results of two simulation studies. An application to real data is shown
in Section 5. Section 6 concludes.
In the multivariate analysis of returns, the assumption about their distribution is a critical issue. In the
past the hypothesis of Gaussianity has been largely exploited, but it has seldom provided satisfactory
results in terms of density forecasts which are very useful in risk management (e.g. to compute the
value-at-risk or the expected shortfall). In recent years, a great deal of interest in non-normal probabil-
ity laws has contributed to overcoming the traditional Gaussian distribution. The multivariate t and
skew-t distributions are remarkable examples. However, a high degree of flexibility can be reached
using a copula function.
A copula function is a multivariate distribution function with standard uniform marginal distribu-
tions. According to the theorem proposed by Sklar (1959), each distribution function H(x1,x2, . . . ,xn)
can be expressed by a copula function whose arguments are the univariate distribution functions; that is:
If the distribution function H is continuous, then the copula C is unique. Conversely, if C is a copula
and F1(x1), F2(x2), . . . , Fn(xn) are the marginal distributions, then H(x1,x2, . . . ,xn) is a joint distribution
function with margins Fi(·).
The main advantage of using a copula function is that the specification of the marginal distributions
can be separated from the definition of the dependence structure.
It is common to denote ui = Fi(xi), so that equation (1) is usually presented as
H ( x1 , x2 ,…, xn ) = C ( u1 , u2 ,…, un )
The most popular families of copula functions are the elliptical and the Archimedean. Among the
elliptical copulae, a prominent role is assigned to the Gaussian and the Student’s t copulae.
The Archimedean copulae are defined through a generator function, Φ: I → R+, continuous, decreas-
ing and convex, such that Φ(1) = 0. A bivariate Archimedean copula is expressed as
C ( u1 , u2 ) = Φ −1 ( Φ( u1 ) + Φ( u2 ))
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
94 G. DE LUCA ET AL.
C ( u1 ,…, un ) = Φ −1 ( Φ( u1 ) + Φ( u2 ) +…+ Φ( un ))
A remarkable example, widely used in financial time-series analyses, is the Clayton copula, given by
n −1 / θ
characterized by the parameter θ > 0. The presence of a unique parameter, which captures the co-
movements only between extremely low values (that is, in the lower tail of the distribution) gives to
this copula function a limited capacity of explaining very complex relationships among n variables.
A more satisfying description of these relationships can be achieved by considering a copula function
able to capture the movements in both the tails, such as the Joe–Clayton copula (also known as BB7
copula in the bivariate case according to the classification in Joe (1997)), which is a generalization of
the Clayton copula. It is formalized as
{ }
−1 / θ 1/ κ
⎛ n
⎞
C ( u1 ,…, un ) = 1 − ⎜1 − ∑ [1 − (1 − ui )κ ]−θ − ( n − 1) ⎟ (3)
⎝ i =1 ⎠
and is then characterized by two parameters, θ > 0 and κ ≥ 1. The generator function and its inverse
are given by Φ(t) = [1 − (1 − t)κ]−θ − 1 and Φ−1(t) = 1 − [1 − (1 + t)−1/θ]1/κ respectively. When κ = 1
we turn back to the Clayton copula.
In general, for an n-variate copula, the density is obtained by computing the nth derivative
∂C ( u1 , u2 ,…, un )
(4)
∂u1∂u2 …∂un
The parameters are usually estimated through the maximum likelihood method.
λLi| j = lim+ P ( Fi ( X i ) ≤ v | F j ( X j ) ≤ v )
v →0
= lim+ P (U i ≤ v | U j ≤ v )
v →0
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 95
and measures the concordance between extremely high values of Xi and Xj.
The choice of the family and of the specific copula can be driven by the observed dependence
between extreme values. For example, the Gaussian copula does not allow for any tail dependence,
whereas the t-copula models dependence in the two tails in the same way. These copulae should be
chosen when the hypothesis of absence or equality of upper and lower tail dependence respectively
is reasonable.
On the other hand, Archimedean copulae are more manageable from this point of view. They admit
lower or upper tail dependence, or both, in a nonsymmetric way.
In the financial context, the lower tail dependence assumes a very important role. In fact, the depend-
ence between extremely low values of returns is a measure of the risk related to a set of assets. Thus,
a significant statistical tool for risk management can be a copula function able to model at least the
lower tail dependence properly.
The Joe–Clayton copula, equation (3), admits both lower and upper tail dependence, whose coef-
ficients are given by λLi| j = 2−1/θ and λUi| j = 2 − 21/κ for i,j = 1, . . . ,n and i ≠ j.
or we could be interested in the probability of extremely low values of m assets, given an extremely
low value of the j-th asset:
and so on. These conditional probabilities are easily evaluated from the equation of the copula func-
tion. Multivariate upper tail dependence coefficients can be defined in an analogous way.
When we have to select a certain number of assets starting from a reference asset (referred to as
asset 1) to which other assets (referred to as assets 2, 3, . . .) are added in turn, we can compute a
sequence of lower tail dependence coefficients, following the order of the added assets, enlarging in
turn the information set, that is the conditioning event. So, after estimating
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
96 G. DE LUCA ET AL.
the lower tail dependence coefficient between asset 1 and asset 2, we can estimate
λL123
|
= lim+ P (U1 ≤ v | U 2 ≤ v,U 3 ≤ v )
v →0
until
This sequence of conditional probabilities can be seen as a measure of the tail dependence of a
reference asset (asset 1) on the other assets. It offers a clear view of the riskiness of asset 1 in a crisis
period describing the possible contagion in terms of probabilities.
At the same time, we could be interested in
until
It is worth noting that the following relation holds between the coefficient (6) and some marginal
coefficients of the form of equation (5):
Thus, the tail dependence coefficient (6) is able to measure what we call chain effect risk in an n-set
of assets; that is, the probability of a crisis for the entire n-set, given that a default has occurred for
asset n.
Applying these definitions, De Luca and Rivieccio (2010) have shown that for the n-dimensional
Joe–Clayton copula, equation (3), the above-mentioned lower tail dependence coefficients are
given by
( )
−1 / θ
n
λL12| …n =
n −1
and
λL1…n−1|n = ( n )−1/θ
The estimation of tail dependence coefficients with a copula function can be used in order to select,
out of a great set of p possible investment alternatives, a subset of k assets with minimum or maximum
multivariate tail dependence, which hereafter will be called the k-set. This idea is quite general and
allows one to opt for two different risk management strategies: defensive, when we choose assets with
minimum lower tail dependence, or aggressive, when we choose assets with maximum upper tail
dependence. The former is recommendable from a financial crisis perspective, as the latter can be used
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 97
when high returns are expected in the market. As stated before, in this paper we limit our empirical
analysis to the first case. In addition, we suppose to have a reference asset we want to necessarily
include in the k-set.
The first problem is to define what we mean when we speak of minimum tail dependence. As a
matter of fact, while in a bivariate context tail dependence is described by a unique coefficient, in a
multivariate analysis there are a lot of coefficients describing the tail dependence structure of a k-set.
For example, there are coefficients like equation (5) or like equation (6) and, for each of these two
types, all the marginal coefficients measuring tail dependence in subsets of the k-set. Hence, we have
to decide which is our main goal; that is, which is the multivariate tail dependence coefficient we want
to minimize. From a financial point of view, recalling the chain effect described by equation (7), we
think that the coefficient (6) provides a good description of the tail dependence structure of the k-set;
thus, its minimization will be the goal of our assets selection.
The second problem is to formulate a procedure to select the k assets with the minimum (6). Given
a reference asset, a first way consists in using copula functions in order to estimate equation (6)
relative to all the ( kp −− 11) k-sets containing the reference asset and then choosing the k-set exhibiting the
minimum estimated value. Although the approach with copula functions has been recognized to offer
good estimates of the tail dependence coefficients, it rapidly becomes cumbersome to apply in a high-
dimensional context, because it requires the optimization of a very complex likelihood function.
Hence, this procedure can be applied only for very small values of k, as will be clear in the real data
example presented in the Section 3.1.
The third problem is to set the value k. Since coefficient (6) tends to decrease as k increases, a
straightforward solution is to run the procedure for k = 2, k = 3, . . . , until it reaches a defined size.
In other words, for each value of k we select the set minimizing equation (6) and we stop when this
minimum is sufficiently low, then we choose the smallest set of indices with a chain effect risk lower
than a given threshold.
2
The MSCI World Index consists of the following developed market country indices: Australia, Austria, Belgium, Canada,
Denmark, Finland, France, Germany, Greece, Hong Kong, Ireland, Italy, Japan, Netherlands, New Zealand, Norway, Portugal,
Singapore, Spain, Sweden, Switzerland, the UK, and the USA.
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
98 G. DE LUCA ET AL.
For k = 3 we estimate all the possible trivariate Joe–Clayton copulae. We have to estimate 231
copulae. We find that the minimum lower tail dependence coefficient, equation (6), is obtained in
correspondence of the triplet MSCI-Italy, MSCI-Japan, MSCI-USA, with λ LIt,Ja|US = 0.0061. The value
is again over the threshold, so we should continue increasing k, but the computational burden is now
very high. In fact, for k = 4 we have to estimate 1540 copulae, for k = 5 it is 7315 copulae, and so
on. It is clear that, when the dimension of the problem increases, the method described rapidly becomes
dramatically time consuming, both for the increasing number of estimates and for the increasing
complexity of the likelihood function. An alternative procedure is then recommended.
As shown above, when a large set of investment alternatives is available, we need a manageable selec-
tion procedure in terms of the computational burden. Here, we propose a heuristic approach (Section
4.2) based on the variable importance measurement by means of the algorithmic technique of Random
Forest, well known in the field of data mining, which will be recalled first in Section 4.1. The proce-
dure is aimed at selecting a k-set exhibiting a low (high) joint association in lower (upper) extreme
values out of a large set of investment opportunities. This is made by building a hierarchical structure
around a reference asset we are specifically interested to. The performance of the proposed procedure
is inspected with some simulation studies (Section 4.3).
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 99
order to identify informative predictors (Breiman, 2002). In the recent literature, M1 and M4 are
addressed as the two main RF variable importance measures:
• Measure 1—mean decrease in accuracy. At each tree of the RF all the values of the hth covariate
are randomly permuted. New predictions are obtained with this dataset, where the role of the hth
covariate is completely destroyed. The prediction error provided by this new dataset is compared
with the prediction error of the original one and the M1 measure for hth variable is given by the
difference of these two errors.
• Measure 4—total decrease in node impurities. At each node z in every tree only a small number of
variables are randomly chosen to split on, relying on some splitting criterion given by a variability/
heterogeneity index such as the mean square error for regression and the Gini index or the Shannon
entropy for classification. Let d(h,z) be the maximum decrease (over all the possible cutpoints) in
the index allowed by variable Xh at node z. Xh is used to split at node z if d(h,z) > d(w,z) for all
variables Xw randomly chosen at node z. The M4 measure is calculated as the sum of all decreases
in the RF due to hth variable, divided by the number of trees.
q j ,α : Pr( Z jt ≤ q j ,α ) = Pr(Tν j ≤ q j ,α ) = α
y jt = {10 if z jt ≤ q j ,α
otherwise
(8)
4. Perform an RF classification with response variable Yht and the others as predictors.
5. Compute variable importance measures and let I(α) = {I 1(α), . . . ,Ih(α−1) ,Ih(α+1) , . . . ,Ip(α)}′ be the vector
containing the relative deviations of the p − 1 measures and their average (relative importances).
6. Repeat steps (2)–(5) for α = α0 − 0.01, α0 − 0.02, . . . , 0.01.
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
100 G. DE LUCA ET AL.
7. Collect the (p − 1) × Lt relative importances in the matrix IL = {I(α ),I(α −0.01), . . . ,I(0.01)}.
0 0
q j ,α : Pr( Z jt ≤ q j ,α ) = Pr(Tν j ≤ q j ,α ) = α
y jt = {10 if z jt ≤ q j ,α
otherwise
(9)
4. Perform an RF classification with response variable Yht and the others as predictors.
5. Compute variable importance measures and let I(α) = {I 1(α), . . . ,Ih(α−1) ,Ih(α+1) , . . . ,Ip(α)}′ be the vector
containing the relative deviations of the p − 1 measures and their average (relative importances).
6. Repeat steps (2)–(5) for α = α0 + 0.01, α0 + 0.02, . . . ,0.99.
7. Collect the p − 1 × (100 − Ut) relative importances in the matrix IU = {I(α ),I(α +0.01), . . . ,I(0.99)}.
0 0
IU ⋅1
IU =
100 − U t
The generic jth elements of the vectors ĪL and ĪU are measures of the average importance of jth asset
extreme values in the prediction of the extreme values of the ‘reference asset’, conditionally on the
extreme values of the other assets. Even so, the selection of a k-set of assets cannot be based only on
the vectors ĪL and ĪU. If we want, for example, k assets with a mutually low association in lower
extreme values, it is not correct to run procedure (A) and select the k − 1 assets having the lowest
importance in predicting the ‘reference asset’ extreme values. In fact, this could lead to select assets
with low association with the ‘reference asset’, but highly associated with each other. In order to avoid
this misconstruction, procedures (A) and (B) have to be iterated k − 1 times, as described in the fol-
lowing procedure (C).
In general, we desire a set of assets having a low association in lower extreme values, in order to
counterbalance negative shocks, or a high association in upper extreme values, in order to accentuate
3
Relying on heuristic reasoning, we recommend to set 10 ≤ Lt ≤ 20 and 80 ≤ Ut ≤ 90. Simulation studies show that within
these ranges the procedure is very robust and the choice of Lt and Ut does not affect results.
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 101
positive shocks. Hereafter, we will assume that procedures (A) and (B) are iterated in (C) in order to
select assets respectively with low and high association.
1. Choose the ‘main reference asset’, the asset which has to be necessarily included in the k-set.
2. Set w = 1.
3. Apply procedure (A) and let ĪL(w) be the vector of the p − w average relative importances contained
in ĪL.
4. Select the asset, say the gth, which satisfies the rule
w w
∑I
i =1
(i )
gL = min
s =1,2 ,…, p −w
∑I
i =1
(i )
sL (10)
(i)
where Ī gL is the average relative importance of gth asset when w = i.
5. Prepare a new iteration of procedure (A) by removing the actual reference asset and setting the gth
asset as new reference asset.
6. Repeat steps (2)–(5) for w = 2,3, . . .,k − 2.
7. Repeat steps (2)–(4) for w = k − 1.
8. Repeat steps (1)–(7) using procedure (B) and replacing equation (10) with
w w
∑I
i =1
(i )
gU = max
s =1,2 ,…, p −w
∑I
i =1
(i )
sU (11)
At the end of procedure (C), two k-sets of assets are selected out of the p assets in the dataset, having
respectively a mutual low and high extreme values association. This selection is obtained with an algo-
rithmic approach which does not need any prior assumption on the multivariate joint distribution.
After the selection, we are ready to perform a more complete and accurate analysis through the
estimation of tail dependence coefficients with copula functions.
Simulation 1. In the first simulation study, N = 1000 observations are randomly drawn from a mul-
tivariate 25-dimensional Student-t distribution X = (X1, . . .,X25) with zero mean vector and correlation
matrix
⎡P1 0 0⎤
PX = ⎢ 0 P2 0⎥
⎥
⎢
⎣0 0 P5 ⎦
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
102 G. DE LUCA ET AL.
1.0
1.0
rate of correct selections
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
correlation coefficient k
Figure 1. Left: rate of correct k-sets versus correlation coefficient (Simulation 1). Right: rate of correctly
selected variables versus k (Simulation 2)
the variables are mutually correlated with correlation coefficient ρ. Variables belonging to different
blocks are uncorrelated. This data-generating process tries to emulate the existence of five different
markets composed of assets associated with each other, but not with assets belonging to other markets.
Since with a multivariate Student-t distribution with ν degrees of freedom, correlation between two
marginal distributions directly reflects on their lower tail dependence, in this situation, a good portfo-
lio should contain assets of different markets. We suppose to desire a k-set composed of k = 5
variables. We decide to set X1 as the main reference variable, thus selecting k −1 = 4 further variables
by means of the proposed heuristic procedure with Lt = 10. Fixing v = 5, the procedure is repeated
r = 50 times for different values of ρ, ranging from 0.05 to 0.9. We consider ‘correct’ a k-set composed
of one variable per block. The rate of correct k-sets rapidly increases when the association among
variables becomes stronger (Figure 1, left). The simulation has been carried out also fixing different
degrees of freedom (v = 15 and v = 30), with the same results.
Simulation 2. In the second simulation study, N = 1000 observations are randomly drawn from a
multivariate 15-dimensional Student-t distribution X = (X1, . . .,X15) with zero mean vector and cor-
relation matrix
⎡ P1 P12 P13 ⎤
PX = ⎢P21 P2 P23 ⎥
⎣ P31 P32 P3 ⎦
where P1 = P2 = P3 = (1 − 0.6)I5 + 0.6 and P12 = P13 = P23 = P′21 = P′31 = P′32 are (5 × 5) matrices
obtained by multiplication of the column vector (0 0.1 0.2 0.3 0.4)′ and the five-dimensional row
vector of ones. Thus, the 15 variables are divided into three blocks. Each variable has a moderately
high correlation (ρ = 0.6) with variables belonging to the same block, but also has an increasing cor-
relation with the variables belonging to different blocks (ρ = 0 with the first variable of each block,
ρ = 0.1 with the second, ρ = 0.2 with the third, ρ = 0.3 with the fourth, ρ = 0.4 with the last). In this
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 103
k Correct selections
3 s3 = X1,X6,X11
4 s41 = (s3,X2) s42 = (s3,X7) s43 = (s3,X12)
5 s51 = (s41,X7) s52 = (s42,X12) s53 = (s43,X2)
6 s6 = (s3,X2,X7,X12)
7 s71 = (s6,X3) s72 = (s6,X8) s73 = (s6,X13)
8 s81 = (s71,X8) s82 = (s72,X13) s83 = (s73,X3)
9 s9 = (s6,X3,X8,X13)
10 s101 = (s9,X4) s102 = (s9,X9) s103 = (s9,X14)
11 s111 = (s101,X9) S112 = (s102,X14) S113 = (s103,X4)
12 s12 = (s9,X1,X9,X14)
situation, the selection of the k-set is more challenging. If we set X1 as the main reference variable,
then the algorithm should proceed selecting first one of the first variables of each block, then one of
the second ones, then one of the third ones, and so on, until the remaining k − 1 variables are selected
(Table I).
The procedure is repeated r = 50 times with Lt = 10, for different values of k, from 3 to 12. For
each repetition, let ncv be the number of correct variables in the selected k-set, the average rate of
correctly selected variables
In this section we apply the proposed algorithmic procedure to the dataset described in Section 3.1,
in order to select a k-set with minimum lower tail dependence, choosing the smallest value k ensuring
that coefficient (6) is lower than 0.005. As pointed out in Section 3.1, we first depurate data from
autocorrelation and heteroskedasticity, by means of univariate Student-t AR-GARCH models applied
to the log-returns. Then the procedure is carried out using the standardized residuals, setting Lt = 10
and with MSCI-Italy as main reference asset.
We run the procedure for k = 2, k = 3, . . ., each time computing the coefficient (6) and stopping
when it is lower than the chosen threshold.4 The step-by-step results are as follows:
• k=2
Procedure (A) is applied with MSCI-Italy as reference asset, and the vector ĪL, which we will call
Ī(1)
L because it is computed at the first iteration, is obtained (Figure 2, top left). MSCI-Japan is selected.
The first k-set (k = 2) is then {It, Ja} (consistent with the results obtained in Section 3.1), with
λLIt|Ja = 0.0201. This value is greater than the given threshold of 0.005, so we have to increase k.
4
The computations of the asset selection algorithm are carried out using the library randomForest of the R package (R
Development Core Team, 2006). The R script is available on e-mail request to the corresponding author.
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
104 G. DE LUCA ET AL.
Iteration 1: Italy vs. remaining indexes Iteration 2: Japan vs. remaining indexes
3 3
Average relative importances
1 1
0 0
−1 −1
−2 −2
−3 −3
France
Sweden
Australia
Germany
Spain
Netherlands
Belgium
UK
Switzerland
Austria
Portugal
Finland
Australia
Ireland
Norway
Denmark
Canada
USA
Greece
New_Zealand
Singapore
Hong_Kong
Japan
Hong_Kong
Singapore
New_Zealand
Portugal
Norway
Austria
Denmark
Sweden
Greece
Canada
Belgium
Finland
Ireland
UK
Spain
Germany
Netherlands
USA
Switzerland
France
Iteration 3: USA vs. remaining indexes Iteration 4: Greece vs. remaining indexes
3 3
Average relative importances
2 2
1 1
0 0
−1 −1
−2 −2
−3 −3
Canada
France
Switzerland
Germany
Spain
Belgium
Netherlands
UK
Sweden
Australia
Norway
Ireland
Austria
Denmark
Finland
Portugal
New_Zealand
Hong_Kong
Belgium
Switzerland
Denmark
France
Norway
Netherlands
Singapore
Greece
Sweden
Austria
Portugal
Ireland
Singapore
Spain
Germany
Hong_Kong
Finland
Canada
UK
Australia
New_Zealand
• k=3
Since the algorithmic procedure is hierarchical, the first step is the same as for k = 2. After the
selection of MSCI-Japan, MSCI-Italy is removed from the dataset, procedure (A) is applied with
MSCI-Japan as reference asset and the vector Ī (2)L is obtained (Figure 2, top right). For each index
in the dataset the values in Ī (1) (2)
L and Ī L are summed (Figure 3, top right). MSCI-USA (exhibiting the
minimum summed value) is selected. The second k-set (k = 3) is then {It, Ja, US} (again, consistent
with the results obtained in Section 3.1), with λLIt,Ja|US = 0.0061. We have to increase k.
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 105
3 3
2 2
1 1
0 0
−1 −1
−2 −2
−3 −3
France
Germany
Spain
Netherlands
Belgium
UK
Switzerland
Austria
Portugal
Sweden
Finland
Ireland
Norway
Denmark
Canada
USA
Greece
Australia
New_Zealand
Singapore
Hong_Kong
Japan
France
Australia
Germany
Spain
Hong_Kong
Netherlands
Singapore
Belgium
UK
Austria
Switzerland
Sweden
Portugal
Norway
Denmark
Finland
New_Zealand
Canada
Ireland
Greece
USA
Sum of average relative importances − Iteration 1+ 2+ 3+ 4
Sum of average relative importances − Iteration 1+ 2+ 3
3 3
2 2
1 1
0 0
−1 −1
−2 −2
−3 −3
Canada
France
Sweden
Portugal
Norway
Denmark
France
Finland
New_Zealand
Ireland
Greece
Canada
Austria
Germany
Australia
Spain
Hong_Kong
Netherlands
Belgium
Switzerland
Austria
UK
Singapore
Germany
Australia
Netherlands
Switzerland
Norway
Spain
Hong_Kong
Sweden
Belgium
Singapore
Portugal
UK
Ireland
Denmark
Finland
New_Zealand
∑
w (i )
Figure 3. Bar graphs of the summed standardized variable importance measures I
i =1 sL
for w = 1,2,3,4 and
s = 1,2, . . . ,23 − w
• k=4
Again, the first two steps are the same as for k = 3. After the selection of MSCI-USA, MSCI-Japan
is removed from the dataset, procedure (A) is applied with MSCI-USA as reference asset and the
vector Ī (3) (1) (2)
L is obtained (Figure 2, bottom left). For each index in the dataset, the values in Ī L , Ī L
(3)
and Ī L are summed (Figure 3, bottom left). MSCI-Greece (exhibiting the minimum summed value)
is selected. The third k-set (k = 4) is then {It, Ja, US, Gr}, with λLIt,Ja,US|Gr = 0.0055. We have to
increase k.
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
106 G. DE LUCA ET AL.
• k=5
The first three steps are the same as for k = 4. After the selection of MSCI-Greece, MSCI-USA is
removed from the dataset, procedure (A) is applied with MSCI-Greece as reference asset and the
vector Ī (4) (1) (2)
L is obtained (Figure 2, bottom right). For each index in the dataset, the values in Ī L , Ī L ,
(3) (4)
Ī L and Ī L are summed (Figure 3, bottom right). MSCI-New Zealand (exhibiting the minimum
summed value) is selected, with λLIt,Ja,US,Gr|NZ = 0.0015. The desired threshold has been reached, so
the final k-set (k = 5) is given by {It, Ja, US, Gr, NZ}.
Tables II–V show details about the estimation of the parameters of the Joe–Clayton copulae. In
order to check the goodness of fit of the estimated copulae, extending Dobric and Schmid (2006), we
have considered the general null hypothesis that a multivariate data set can be described by a specified
copula
H 0 : ( X1 , X 2 …, X n ) has copula C
θ 0.1773 0.0289
κ 1.0757 0.0261
CvM 0.1126
|
λLIt Ja 0.0201
θ 0.2155 0.0191
κ 1.0857 0.0192
CvM 0.4222
|
λLIt,Ja US 0.0061
θ 0.2664 0.0159
κ 1.1007 0.0168
CvM 0.7393
|
λLIt,Ja,US Gr 0.0055
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 107
θ 0.2486 0.0130
κ 1.0912 0.0145
CvM 0.8593
|
λLIt,Ja,US,Gr NZ 0.0015
H 0* : S ( X1 , X 2 ,…, X n ) ~ χ n2
where
This hypothesis can be tested using one of the goodness-of-fit tests popularized in the literature, such
as the Cramer–von Mises (CvM) test. In almost all the cases, the hypothesis is accepted at the 1%
significance level (critical value: 0.743). In the last case (n = 5), the test statistic is slightly below the
critical value.
0.0
−0.1
8e−04
−0.2
Target Return[mean]
6e−04
−0.3
−0.4
4e−04
−0.5
2e−04
MV | solveRquadprog
−0.6
0e+00
−0.7
0.006 0.008 0.010 0.012 0 50 100 150
Figure 4. Left: efficient frontier of the selected set of assets together with the efficient frontiers of 100
randomly selected five-dimensional sets of assets including the reference asset. Right: returns of the minimum
variance portfolio for the selected assets against returns of the 100 minimum variance competitor portfolios in
the second semester of 2008
6. CONCLUDING REMARKS
This paper deals with the problem of selecting a subset of financial assets out of a large set of invest-
ment alternatives. The aim driving the selection is to choose assets having either low association in
the lower tail or high association in the upper tail, according to the investment strategy. We mainly
inspected the former case, assuming a financial crisis perspective for the empirical analysis.
Copula functions are powerful tools for estimating tail dependence coefficients. Nevertheless, we
have found that their use for the above-described selection problem is unfeasible due to an excessive
computational burden. For this reason we have explored the possibility of building a heuristic proce-
dure making use of algorithmic tools widely used in the field of data mining. The proposed selection
procedure has first been checked with two simulation studies, then applied to real data of MSCI indices,
in order to select a subset of developed markets to invest in. In the lower tail, we focused our attention
on the chain effect risk; that is, the probability of a default of the whole set of indices, given the default
of one of them.
The proposed procedure is heuristic and merely descriptive, but can be applied to many empirical
contexts. With large datasets it can provide a good preliminary inspection of the relationships among
variables in the tails of their joint distributions. This makes possible a sort of dimensionality reduction
which allows one to employ more efficient estimation tools, like a copula function.
A number of possible developments in this direction are possible.
First, the procedure can be generalized, avoiding the preliminary definition of a reference asset.
Actually, the idea of choosing a reference asset can be justified in many empirical analyses, when
there effectively exists an asset we want to necessarily include in the portfolio, or when we want to
add new investments to an existing one. Nonetheless, in some cases it can be limiting. The problem
can be easily overcome, simply running the procedure as many times as the number p of variables in
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 109
the dataset, each time choosing a different main reference asset. At the end, p different k-sets are
selected. Each one can be deeply analysed by means of copula functions and finally the best one can
be chosen, according to a defined criterion.
Second, in the analysis presented, the chain effect is evaluated at the same time t. In other words,
the coefficient (6) measures the probability of a simultaneous default of all the indices. This is not so
limiting, as it is well known that a default in a developed market rapidly propagates to the others.
However, the inclusion of lagged variables in the dataset could probably improve the results, taking
account of some delay in the default and also of effects due to different time zones.
Third, other types of copula function can be used to estimate the tail dependence coefficients, then
using goodness-of-fit statistics in order to decide which one best reproduces the joint distribution of data.
Finally, some reasoning could be undertaken about the distribution of tail dependence coefficient
estimates.
REFERENCES
Aas K, Berg D. 2009. Models for construction of multivariate dependence—a comparison study. The European
Journal of Finance 15: 639–659.
Bouyé E, Salmon M. 2009. Dynamic copula quantile regressions and tail area dynamic dependence in Forex
markets. The European Journal of Finance 15: 721–750.
Breiman L. 1996. Bagging predictions. Machine Learning 24: 123–140.
Breiman L. 2001. Random forests. Machine Learning 45: 5–32.
Breiman L. 2002. Manual on setting up, using, and understanding random forests v3.1. http://oz.berkeley.edu/
users/breiman.
Breiman L, Friedman JH, Olshen RA, Stone CJ. 1984. Classification and Regression Trees. Chapman & Hall,
New York, 1984.
Cherubini U, Luciano E, Vecchiato W. 2004. Copula Methods in Finance. John Wiley and Sons, Inc., New York.
De Luca G, Rivieccio G. 2010. Multivariate tail dependence coefficients for Archimedean copulae. Unpublished
research.
Dobric J, Schmid F. 2006. A goodness of fit test for copulas based on Rosenblatt’s transformation. Computational
Statistics and Data Analysis 51: 4633–4642.
Engle RF. 2002. Dynamical conditional correlation: a simple class of multivariate generalized autoregressive
heteroscedasticity models. Journal of Business and Economic Statistics 20: 339–350.
Fortin I, Kuzmics C. 2002. Tail-dependence in stock-returns pairs. International Journal of Intelligent Systems
in Accounting, Finance & Management 11: 89–107.
Friedman JH. 2001. Greedy function approximation: a gradient boosting machine. Annals of Statistics 29:
1189–1232.
Friedman JH, Popescu BE. 2005. Predictive learning via rule ensembles. Technical report, Stanford University.
Genest G, Gendronb M, Bourdeau-Brienb M. 2009. The advent of copulas in finance. The European Journal of
Finance 15: 609–618.
Joe H. 1997. Multivariate Models and Dependence Concept. Chapman & Hall, New York.
Jondeau E, Rockinger M. 2006. The copula–GARCH model of conditional dependencies: an international stock
market application. Journal of International Money and Finance 25: 827–853.
McNeil AJ. 1999. Extreme value theory for risk managers. In Internal Modelling and CAD II: Qualifying and
Quantifying Risk within a Financial Institution. RISK Books: London; 93–113.
Nelsen RB. 2006. An Introduction to Copulas. Springer-Verlag, New York.
Palaro HP, Hotta LK. 2006. Using conditional copula to estimate value at risk. Journal of Data Science 4: 93–115.
Patton A. 2006. Modelling asymmetric exchange rate dependence. International Economic Review 47: 527–556.
R Development Core Team. 2006. R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Wien.
Sklar A. 1959. Fonctions de répartition à n dimensions et leurs marges. Publications de l’Institut de Statistique
de l’Université de Paris 8: 229–231.
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf