0% found this document useful (0 votes)
15 views

Combining Random Forest and Copula Functions, De Luca

This paper presents a heuristic strategy for selecting financial assets by analyzing their multivariate tail dependence structure, particularly during financial crises. The approach utilizes data mining tools, specifically the Random Forest technique, to select assets based on their tail dependence coefficients estimated through copula functions. The proposed method is organized into three steps: filtering data, selecting assets based on tail dependence, and estimating multivariate tail dependence coefficients.

Uploaded by

joaor.jungblut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Combining Random Forest and Copula Functions, De Luca

This paper presents a heuristic strategy for selecting financial assets by analyzing their multivariate tail dependence structure, particularly during financial crises. The approach utilizes data mining tools, specifically the Random Forest technique, to select assets based on their tail dependence coefficients estimated through copula functions. The proposed method is organized into three steps: filtering data, selecting assets based on tail dependence, and estimating multivariate tail dependence coefficients.

Uploaded by

joaor.jungblut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022].

See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
INTELLIGENT SYSTEMS IN ACCOUNTING, FINANCE AND MANAGEMENT
Intell. Sys. Acc. Fin. Mgmt. 17, 91–109 (2010)
Published online in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/isaf.315

COMBINING RANDOM FOREST AND COPULA FUNCTIONS:


A HEURISTIC APPROACH FOR SELECTING ASSETS FROM A
FINANCIAL CRISIS PERSPECTIVE

GIOVANNI DE LUCAa, GIORGIA RIVIECCIOa AND PAOLA ZUCCOLOTTOb*


a
Università degli Studi di Napoli Parthenope, Dipartimento di Statistica e Matematica per la Ricerca Economica,
Naples, Italy
b
Università degli Studi di Brescia, Dipartimento di Metodi Quantitativi, Brescia, Italy

SUMMARY
In this paper we propose a heuristic strategy aimed at selecting and analysing a set of financial assets, focusing
attention on their multivariate tail dependence structure. The selection, obtained through an algorithmic procedure
based on data mining tools, assumes the existence of a reference asset we are specifically interested to. The
procedure allows one to opt for two alternatives: to prefer those assets exhibiting either a minimum lower tail
dependence or a maximum upper tail dependence. The former could be a recommendable opportunity in a finan-
cial crisis period. For the selected assets, the tail dependence coefficients are estimated by means of a proper
multivariate copula function. Copyright © 2010 John Wiley & Sons, Ltd.

Keywords: copula functions; Archimedean copula; tail dependence; Random Forest

1. INTRODUCTION

In recent decades financial markets have been characterized by an increasing globalization, and a
complex set of relationships among asset returns has been established. In spite of these close connec-
tions, cross-market correlation coefficients rarely assume significantly high values. In this context,
Engle (2002) has demonstrated that we should preferably analyse dynamic correlation, showing that
asset returns are positively associated conditionally on market volatility. The presence of a stronger
dependence when markets are more volatile (and especially during crises) suggests investigating the
presence of an appreciably higher association in the tails of the joint distribution. In the literature this
phenomenon is known as tail dependence. The main feature of joint distributions characterized by tail
dependence is the presence of heavy and possibly asymmetric tails; thus, the traditional hypothesis of
(multivariate) Gaussianity is completely inadequate. In the absence of a reasonable alternative distri-
butional assumption, a copula approach can be particularly interesting.
Copula functions are effective quantitative tools for modelling the joint dependence of random
variables; for example, see Joe (1997), Cherubini et al. (2004) and Nelsen (2006). The use of copula
functions in finance is recent and the history of its rapid growth can be read in Genest et al. (2009).
Applications of copula functions to bivariate financial time series have been carried out for capturing
the dynamics of the dependence structure (Jondeau and Rockinger, 2006; Patton, 2006; Bouyé and

* Correspondence to: Paola Zuccolotto, Dipartimento Metodi Quantitativi, C.da S. Chiara, 50-25122 Brescia, Italy.
E-mail: [email protected]

Copyright © 2010 John Wiley & Sons, Ltd.


10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
92 G. DE LUCA ET AL.

Salmon, 2009), for estimating the value-at-risk (Palaro and Hotta, 2006) or for measuring the tail
dependence (Fortin and Kuzmics, 2002).
The main advantage of copula functions is that they allow us preliminarily and separately to model
the marginal distributions, which are then joined into a multivariate distribution. A second desirable
property is that some copula functions imply very flexible joint distributions, able to fulfil an in-depth
analysis of the tail dependence structure. Unfortunately, this is not the case of the most common copula
family, the elliptical family, including the Gaussian and Student-t, which suffer from an absent or
symmetric lower and upper tail dependence respectively. On the other hand, the Archimedean family
allows for different lower and upper tail dependence.
Owing to the complex structure of financial markets, a high-dimensional multivariate approach to
tail dependence analysis is surely more insightful. However, a growing number of jointly modelled
variables hugely and rapidly increases formal and computational complexity. In an analysis of geo-
graphical indices, for example, a joint study of all the markets in the world is impossible. In general,
even a drastic restriction of the study to the so-called developed markets is not sufficient to provide a
manageable number of variables.
To cope with the dimensionality problem, a number of strategies based on the reduction of a mul-
tivariate copula to a cascade of bivariate copulae can be found in the literature (see Aas and Berg
(2009) for a detailed comparison of the proposed techniques). Alternatively, a selection procedure of
the most suitable (according to some definite rule) assets is necessary. However, in high-dimensional
contexts, the selection procedure itself can be computationally burdensome.
In this paper we propose to realize the selection using data mining tools. We face the problem with
a heuristic reasoning and we propose an algorithmic procedure, based on the recent Random Forest
technique—see Breiman (2001)—in order to opportunely select the assets we want to introduce in an
analysis aimed at investigating tail dependence. The selection is built with a hierarchical structure
around an asset we are interested in. In other words, we first choose a reference asset we want to
necessarily include in the set of analysed assets, as frequently happens in investment strategies, and
then, after filtering the data from autocorrelation and heteroskedasticity, we select step by step the
other assets, by adding an asset at each step until a termination criterion is satisfied. For the selected
assets, we propose to use a copula approach in order to estimate the tail dependence coefficients.1
Thus, the aim of this paper is to propose a structured procedure for selecting and analysing a set of
financial assets, focusing attention on their joint tail dependence.
Given a large set of financial assets, the proposed strategy is organized in three steps:

1. Application of univariate models to the financial returns, in order to filter the data from autocor-
relation and heteroskedasticity.
2. Selection of k financial assets (including the reference one) whose returns exhibit a low (high) level
of lower (upper) tail dependence. In detail, in an investment perspective, we could aim at including
in a portfolio assets with low association in the case of negative shocks, or assets with high asso-
ciation in the case of positive shocks. The choice between the two strategies depends upon the
expectation about the future trends of the financial markets and upon the desired risk degree of the
investors. In this paper we mimic a financial crisis perspective; hence, we will focus on the selec-
tion of assets with low association in the lower tail. The extension to the upper tail is straightforward.
3. Estimation of the multivariate tail dependence coefficients.

1
An alternative idea in this context is to model the tails of the multivariate distributions using extreme value theory; see
McNeil (1999).

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 93

For each step we propose a specific statistical tool:

1. an AR-GARCH model for univariate returns in order to obtain standardized residuals;


2. an algorithmic heuristic selection procedure with a hierarchical structure, based on the analysis
of joint association in the extreme values by means of a data mining technique called Random
Forest;
3. a copula function for the estimation of multivariate tail dependence coefficients.

The paper is organized as follows. In Section 2 the theory of copula functions is briefly recalled
and the tail dependence coefficients are defined. Section 3 describes an empirical problem showing
that a selection procedure using copula functions is computationally unfeasible. Section 4, after briefly
recalling the Random Forest algorithm, illustrates the functioning of the proposed asset selection
procedure, also presenting the results of two simulation studies. An application to real data is shown
in Section 5. Section 6 concludes.

2. COPULA FUNCTIONS AND TAIL DEPENDENCE

In the multivariate analysis of returns, the assumption about their distribution is a critical issue. In the
past the hypothesis of Gaussianity has been largely exploited, but it has seldom provided satisfactory
results in terms of density forecasts which are very useful in risk management (e.g. to compute the
value-at-risk or the expected shortfall). In recent years, a great deal of interest in non-normal probabil-
ity laws has contributed to overcoming the traditional Gaussian distribution. The multivariate t and
skew-t distributions are remarkable examples. However, a high degree of flexibility can be reached
using a copula function.
A copula function is a multivariate distribution function with standard uniform marginal distribu-
tions. According to the theorem proposed by Sklar (1959), each distribution function H(x1,x2, . . . ,xn)
can be expressed by a copula function whose arguments are the univariate distribution functions; that is:

H ( x1 , x2 ,…, xn ) = C ( F1 ( x1 ), F2 ( x2 ),…, Fn ( xn )) (1)

If the distribution function H is continuous, then the copula C is unique. Conversely, if C is a copula
and F1(x1), F2(x2), . . . , Fn(xn) are the marginal distributions, then H(x1,x2, . . . ,xn) is a joint distribution
function with margins Fi(·).
The main advantage of using a copula function is that the specification of the marginal distributions
can be separated from the definition of the dependence structure.
It is common to denote ui = Fi(xi), so that equation (1) is usually presented as

H ( x1 , x2 ,…, xn ) = C ( u1 , u2 ,…, un )

The most popular families of copula functions are the elliptical and the Archimedean. Among the
elliptical copulae, a prominent role is assigned to the Gaussian and the Student’s t copulae.
The Archimedean copulae are defined through a generator function, Φ: I → R+, continuous, decreas-
ing and convex, such that Φ(1) = 0. A bivariate Archimedean copula is expressed as

C ( u1 , u2 ) = Φ −1 ( Φ( u1 ) + Φ( u2 ))

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
94 G. DE LUCA ET AL.

In the n-dimensional case, an Archimedean copula function is defined as

C ( u1 ,…, un ) = Φ −1 ( Φ( u1 ) + Φ( u2 ) +…+ Φ( un ))

A remarkable example, widely used in financial time-series analyses, is the Clayton copula, given by

n −1 / θ

C ( u1 ,…, un ) = ⎡⎢ ∑ ui−θ − ( n − 1) ⎤⎥ (2)


⎣ i =1 ⎦

characterized by the parameter θ > 0. The presence of a unique parameter, which captures the co-
movements only between extremely low values (that is, in the lower tail of the distribution) gives to
this copula function a limited capacity of explaining very complex relationships among n variables.
A more satisfying description of these relationships can be achieved by considering a copula function
able to capture the movements in both the tails, such as the Joe–Clayton copula (also known as BB7
copula in the bivariate case according to the classification in Joe (1997)), which is a generalization of
the Clayton copula. It is formalized as

{ }
−1 / θ 1/ κ
⎛ n

C ( u1 ,…, un ) = 1 − ⎜1 − ∑ [1 − (1 − ui )κ ]−θ − ( n − 1) ⎟ (3)
⎝ i =1 ⎠

and is then characterized by two parameters, θ > 0 and κ ≥ 1. The generator function and its inverse
are given by Φ(t) = [1 − (1 − t)κ]−θ − 1 and Φ−1(t) = 1 − [1 − (1 + t)−1/θ]1/κ respectively. When κ = 1
we turn back to the Clayton copula.
In general, for an n-variate copula, the density is obtained by computing the nth derivative

∂C ( u1 , u2 ,…, un )
(4)
∂u1∂u2 …∂un

The parameters are usually estimated through the maximum likelihood method.

2.1. Tail Dependence


Given two random variables, Xi and Xj, several measures of association can be considered. The most
popular measures are the linear correlation and the concordance. However, there are many other
measures. The tail dependence is a key measure when we are interested in risk management. It captures
the concordance between extreme values of the variables. More specifically, we distinguish the lower
tail dependence and the upper tail dependence.
The lower tail dependence coefficient between two variables Xi and Xj, denoted as λLi| j, is given by

λLi| j = lim+ P ( Fi ( X i ) ≤ v | F j ( X j ) ≤ v )
v →0
= lim+ P (U i ≤ v | U j ≤ v )
v →0

It measures the concordance between extremely low values of Xi and Xj.

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 95

Similarly, the upper tail dependence coefficient, denoted as λ Ui| j, is defined as

λUi| j = lim− P ( Fi ( X i ) > v | F j ( X j ) > v )


v →1
= lim− P (U i > v | U j > v )
v →1

and measures the concordance between extremely high values of Xi and Xj.
The choice of the family and of the specific copula can be driven by the observed dependence
between extreme values. For example, the Gaussian copula does not allow for any tail dependence,
whereas the t-copula models dependence in the two tails in the same way. These copulae should be
chosen when the hypothesis of absence or equality of upper and lower tail dependence respectively
is reasonable.
On the other hand, Archimedean copulae are more manageable from this point of view. They admit
lower or upper tail dependence, or both, in a nonsymmetric way.
In the financial context, the lower tail dependence assumes a very important role. In fact, the depend-
ence between extremely low values of returns is a measure of the risk related to a set of assets. Thus,
a significant statistical tool for risk management can be a copula function able to model at least the
lower tail dependence properly.
The Joe–Clayton copula, equation (3), admits both lower and upper tail dependence, whose coef-
ficients are given by λLi| j = 2−1/θ and λUi| j = 2 − 21/κ for i,j = 1, . . . ,n and i ≠ j.

2.2. Multivariate Tail Dependence


Tail dependence is a bivariate concept. An extension of its definition to a multivariate context has
been provided in De Luca and Rivieccio (2010). Given n variables, it is possible to compute a lot of
conditional probabilities. For instance, we could be interested in the probability of an extremely low
value of asset i, given that extremely low values have occurred for m assets:

λLi| j1… jm = lim+ P (U i ≤ v | U j1 ≤ v,…,U jm ≤ v )


v →0

or we could be interested in the probability of extremely low values of m assets, given an extremely
low value of the j-th asset:

λLi1…im | j = lim+ P (U i1 ≤ v,…,U im ≤ v | U j ≤ v )


v →0

and so on. These conditional probabilities are easily evaluated from the equation of the copula func-
tion. Multivariate upper tail dependence coefficients can be defined in an analogous way.
When we have to select a certain number of assets starting from a reference asset (referred to as
asset 1) to which other assets (referred to as assets 2, 3, . . .) are added in turn, we can compute a
sequence of lower tail dependence coefficients, following the order of the added assets, enlarging in
turn the information set, that is the conditioning event. So, after estimating

λL12| = lim+ P (U1 ≤ v | U 2 ≤ v )


v →0

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
96 G. DE LUCA ET AL.

the lower tail dependence coefficient between asset 1 and asset 2, we can estimate

λL123
|
= lim+ P (U1 ≤ v | U 2 ≤ v,U 3 ≤ v )
v →0

until

λL12| …n = lim+ P (U1 ≤ v | U 2 ≤ v,…,U n ≤ v ) (5)


v →0

This sequence of conditional probabilities can be seen as a measure of the tail dependence of a
reference asset (asset 1) on the other assets. It offers a clear view of the riskiness of asset 1 in a crisis
period describing the possible contagion in terms of probabilities.
At the same time, we could be interested in

λL12|3 = lim+ P (U1 ≤ v,U 2 ≤ v | U 3 ≤ v )


v →0

until

λL1…n−1|n = lim+ P (U1 ≤ v,…,U n−1 ≤ v | U n ≤ v ) (6)


v →0

It is worth noting that the following relation holds between the coefficient (6) and some marginal
coefficients of the form of equation (5):

λL1…n−1|n = λL12| …n λL2|3…n λL3|4…n  λLn−1|n (7)

Thus, the tail dependence coefficient (6) is able to measure what we call chain effect risk in an n-set
of assets; that is, the probability of a crisis for the entire n-set, given that a default has occurred for
asset n.
Applying these definitions, De Luca and Rivieccio (2010) have shown that for the n-dimensional
Joe–Clayton copula, equation (3), the above-mentioned lower tail dependence coefficients are
given by

( )
−1 / θ
n
λL12| …n =
n −1
and

λL1…n−1|n = ( n )−1/θ

3. THE PROBLEM OF SELECTING ASSETS FROM A FINANCIAL CRISIS PERSPECTIVE

The estimation of tail dependence coefficients with a copula function can be used in order to select,
out of a great set of p possible investment alternatives, a subset of k assets with minimum or maximum
multivariate tail dependence, which hereafter will be called the k-set. This idea is quite general and
allows one to opt for two different risk management strategies: defensive, when we choose assets with
minimum lower tail dependence, or aggressive, when we choose assets with maximum upper tail
dependence. The former is recommendable from a financial crisis perspective, as the latter can be used

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 97

when high returns are expected in the market. As stated before, in this paper we limit our empirical
analysis to the first case. In addition, we suppose to have a reference asset we want to necessarily
include in the k-set.
The first problem is to define what we mean when we speak of minimum tail dependence. As a
matter of fact, while in a bivariate context tail dependence is described by a unique coefficient, in a
multivariate analysis there are a lot of coefficients describing the tail dependence structure of a k-set.
For example, there are coefficients like equation (5) or like equation (6) and, for each of these two
types, all the marginal coefficients measuring tail dependence in subsets of the k-set. Hence, we have
to decide which is our main goal; that is, which is the multivariate tail dependence coefficient we want
to minimize. From a financial point of view, recalling the chain effect described by equation (7), we
think that the coefficient (6) provides a good description of the tail dependence structure of the k-set;
thus, its minimization will be the goal of our assets selection.
The second problem is to formulate a procedure to select the k assets with the minimum (6). Given
a reference asset, a first way consists in using copula functions in order to estimate equation (6)
relative to all the ( kp −− 11) k-sets containing the reference asset and then choosing the k-set exhibiting the
minimum estimated value. Although the approach with copula functions has been recognized to offer
good estimates of the tail dependence coefficients, it rapidly becomes cumbersome to apply in a high-
dimensional context, because it requires the optimization of a very complex likelihood function.
Hence, this procedure can be applied only for very small values of k, as will be clear in the real data
example presented in the Section 3.1.
The third problem is to set the value k. Since coefficient (6) tends to decrease as k increases, a
straightforward solution is to run the procedure for k = 2, k = 3, . . . , until it reaches a defined size.
In other words, for each value of k we select the set minimizing equation (6) and we stop when this
minimum is sufficiently low, then we choose the smallest set of indices with a chain effect risk lower
than a given threshold.

3.1. Empirical Analysis of MSCI Geographical Indices


The dataset analysed is composed of 23 MSCI market capitalization weighted indices, addressed to
measure the equity market performance of developed markets,2 recorded daily from 3 June 2002 to
10 June 2010 (2095 observations; source: MSCI Barra).
Let us suppose we designate MSCI-Italy as the reference asset. We want to select, out of the p =
23 indices, the k-set (k < p) containing MSCI-Italy, with the minimum (6) and we desire that this
minimum is lower than a threshold that we decide to set at the value 0.005. The log-returns are first
filtered by means of univariate Student-t AR-GARCH models and the tail dependence is evaluated on
the corresponding standardized residuals. For k = 2 we estimate all the 22 possible bivariate Joe–
Clayton copulae, equation (3), between MSCI-Italy and each of the candidate assets. We find that the
minimum lower tail dependence coefficient, equation (6), is obtained in correspondence of the couple
MSCI-Italy and MSCI-Japan, with λLIt|Ja = 0.0201. Since the value is over the defined threshold, we
have to increase k.

2
The MSCI World Index consists of the following developed market country indices: Australia, Austria, Belgium, Canada,
Denmark, Finland, France, Germany, Greece, Hong Kong, Ireland, Italy, Japan, Netherlands, New Zealand, Norway, Portugal,
Singapore, Spain, Sweden, Switzerland, the UK, and the USA.

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
98 G. DE LUCA ET AL.

For k = 3 we estimate all the possible trivariate Joe–Clayton copulae. We have to estimate 231
copulae. We find that the minimum lower tail dependence coefficient, equation (6), is obtained in
correspondence of the triplet MSCI-Italy, MSCI-Japan, MSCI-USA, with λ LIt,Ja|US = 0.0061. The value
is again over the threshold, so we should continue increasing k, but the computational burden is now
very high. In fact, for k = 4 we have to estimate 1540 copulae, for k = 5 it is 7315 copulae, and so
on. It is clear that, when the dimension of the problem increases, the method described rapidly becomes
dramatically time consuming, both for the increasing number of estimates and for the increasing
complexity of the likelihood function. An alternative procedure is then recommended.

4. A HEURISTIC PROCEDURE FOR ASSET SELECTION

As shown above, when a large set of investment alternatives is available, we need a manageable selec-
tion procedure in terms of the computational burden. Here, we propose a heuristic approach (Section
4.2) based on the variable importance measurement by means of the algorithmic technique of Random
Forest, well known in the field of data mining, which will be recalled first in Section 4.1. The proce-
dure is aimed at selecting a k-set exhibiting a low (high) joint association in lower (upper) extreme
values out of a large set of investment opportunities. This is made by building a hierarchical structure
around a reference asset we are specifically interested to. The performance of the proposed procedure
is inspected with some simulation studies (Section 4.3).

4.1. Variable Importance Measurement with Random Forest


Given a response variable and a set of covariates, variable importance measurement allows one to
identify the most important predictors for the response variable within the set of covariates. The pre-
diction problem is called classification if the response variable is categorical and regression if it is
numerical. Some powerful data mining tools have recently been proposed in the framework of learn-
ing ensembles, algorithmic techniques able to face both the problems of prediction and of variable
importance measurement, even in presence of many redundant predictors and of complex relationships
among the variables. Each ensemble member is given by a different function of the input covariates
and predictions are obtained by a linear combination of the prediction of each member (Breiman,
1996; Friedman and Popescu, 2005). Learning ensembles can be built using different prediction
methods; that is, different base learners as ensemble members. The most interesting proposals use
decision trees (more specifically, classification and regression trees; Breiman et al., 1984) as base
learners and are called tree-based learning ensembles. Popular examples are the Random Forest (RF)
technique (Breiman, 2001) or the tree-based gradient boosting machine (Friedman, 2001). Both these
algorithmic techniques identify the most important predictors within the set of covariates, by means
of the computation of some variable importance measures.
The RF technique with randomly selected inputs is sequences of trees grown by selecting at random
at each node a small group of F input variables to split on. This procedure is often used in tandem
with bagging (Breiman, 1996); that is, with a random selection of a subsample of the original training
set at each tree. This simple and effective idea is founded on a complete theoretical apparatus ana-
lytically described by Breiman (2001) in his seminal work. The RF prediction is computed as an
average of the single trees predictions. This successfully neutralizes the well-known instability of
decisions trees. In addition, four measures of variable importance, M1, M2, M3, M4, are available in

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 99

order to identify informative predictors (Breiman, 2002). In the recent literature, M1 and M4 are
addressed as the two main RF variable importance measures:

• Measure 1—mean decrease in accuracy. At each tree of the RF all the values of the hth covariate
are randomly permuted. New predictions are obtained with this dataset, where the role of the hth
covariate is completely destroyed. The prediction error provided by this new dataset is compared
with the prediction error of the original one and the M1 measure for hth variable is given by the
difference of these two errors.
• Measure 4—total decrease in node impurities. At each node z in every tree only a small number of
variables are randomly chosen to split on, relying on some splitting criterion given by a variability/
heterogeneity index such as the mean square error for regression and the Gini index or the Shannon
entropy for classification. Let d(h,z) be the maximum decrease (over all the possible cutpoints) in
the index allowed by variable Xh at node z. Xh is used to split at node z if d(h,z) > d(w,z) for all
variables Xw randomly chosen at node z. The M4 measure is calculated as the sum of all decreases
in the RF due to hth variable, divided by the number of trees.

4.2. The Algorithm


In order to select a k-set of assets with a low (high) mutual extreme values association we set up a
heuristic algorithmic procedure based on the analysis of the association among extreme values within
a set of financial assets.
The proposed technique is based on RF variable importance measures, used for the special purpose
to identify alternatively the most or the least influential predictors of a given outcome.
The procedure can be summarized as follows. Let X1t,X2t, . . . ,Xpt, t = 1,2, . . . T, be the log-returns
time series of p assets. A Student-t AR-GARCH model is fitted to each series. Let Z1t,Z2t, . . . ,Zpt be
the standardized residuals and v̂1,v̂2, . . . ,v̂p be their estimated degrees of freedom. At this point we
have to choose a ‘reference asset’, say the hth, whose extreme values association with the others we
want to analyse.

(A) Lower extreme values association

1. Set α = α0 = Lt/100, where Lt ∈ N and 0 < Lt < 50.


2. Using Student-t distributions with the estimated degrees of freedom Tv, compute the quantiles
q1,α,q2,α, . . . ,qp,α of the p standardized residuals, where

q j ,α : Pr( Z jt ≤ q j ,α ) = Pr(Tν j ≤ q j ,α ) = α

3. Create p binary 0/1 series Y1t,Y2t, . . . ,Ypt according to the rule

y jt = {10 if z jt ≤ q j ,α
otherwise
(8)

4. Perform an RF classification with response variable Yht and the others as predictors.
5. Compute variable importance measures and let I(α) = {I 1(α), . . . ,Ih(α−1) ,Ih(α+1) , . . . ,Ip(α)}′ be the vector
containing the relative deviations of the p − 1 measures and their average (relative importances).
6. Repeat steps (2)–(5) for α = α0 − 0.01, α0 − 0.02, . . . , 0.01.

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
100 G. DE LUCA ET AL.

7. Collect the (p − 1) × Lt relative importances in the matrix IL = {I(α ),I(α −0.01), . . . ,I(0.01)}.
0 0

8. Compute the vector of the average relative importances of the p − 1 predictors


I L ·1
IL =
Lt

where 1 is the Lt × 1 vector of ones.

(B) Upper extreme values association

1. Set α = α0 = Ut/100, where Ut ∈ N and 50 < Ut < 100.3


2. Using Student-t distributions with the estimated degrees of freedom Tv, compute the quantiles
q1,α,q2,α, . . . ,qp,α of the p standardized residuals, where

q j ,α : Pr( Z jt ≤ q j ,α ) = Pr(Tν j ≤ q j ,α ) = α

3. Create p binary 0/1 series Y1t,Y2t, . . . ,Ypt according to the rule

y jt = {10 if z jt ≤ q j ,α
otherwise
(9)

4. Perform an RF classification with response variable Yht and the others as predictors.
5. Compute variable importance measures and let I(α) = {I 1(α), . . . ,Ih(α−1) ,Ih(α+1) , . . . ,Ip(α)}′ be the vector
containing the relative deviations of the p − 1 measures and their average (relative importances).
6. Repeat steps (2)–(5) for α = α0 + 0.01, α0 + 0.02, . . . ,0.99.
7. Collect the p − 1 × (100 − Ut) relative importances in the matrix IU = {I(α ),I(α +0.01), . . . ,I(0.99)}.
0 0

8. Compute the vector of the average relative importances of the p − 1 predictors

IU ⋅1
IU =
100 − U t

where 1 is the (100 − Ut) × 1 vector of ones.

The generic jth elements of the vectors ĪL and ĪU are measures of the average importance of jth asset
extreme values in the prediction of the extreme values of the ‘reference asset’, conditionally on the
extreme values of the other assets. Even so, the selection of a k-set of assets cannot be based only on
the vectors ĪL and ĪU. If we want, for example, k assets with a mutually low association in lower
extreme values, it is not correct to run procedure (A) and select the k − 1 assets having the lowest
importance in predicting the ‘reference asset’ extreme values. In fact, this could lead to select assets
with low association with the ‘reference asset’, but highly associated with each other. In order to avoid
this misconstruction, procedures (A) and (B) have to be iterated k − 1 times, as described in the fol-
lowing procedure (C).
In general, we desire a set of assets having a low association in lower extreme values, in order to
counterbalance negative shocks, or a high association in upper extreme values, in order to accentuate

3
Relying on heuristic reasoning, we recommend to set 10 ≤ Lt ≤ 20 and 80 ≤ Ut ≤ 90. Simulation studies show that within
these ranges the procedure is very robust and the choice of Lt and Ut does not affect results.

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 101

positive shocks. Hereafter, we will assume that procedures (A) and (B) are iterated in (C) in order to
select assets respectively with low and high association.

(C) Selection of k − 1 assets

1. Choose the ‘main reference asset’, the asset which has to be necessarily included in the k-set.
2. Set w = 1.
3. Apply procedure (A) and let ĪL(w) be the vector of the p − w average relative importances contained
in ĪL.
4. Select the asset, say the gth, which satisfies the rule
w w

∑I
i =1
(i )
gL = min
s =1,2 ,…, p −w
∑I
i =1
(i )
sL (10)

(i)
where Ī gL is the average relative importance of gth asset when w = i.
5. Prepare a new iteration of procedure (A) by removing the actual reference asset and setting the gth
asset as new reference asset.
6. Repeat steps (2)–(5) for w = 2,3, . . .,k − 2.
7. Repeat steps (2)–(4) for w = k − 1.
8. Repeat steps (1)–(7) using procedure (B) and replacing equation (10) with
w w

∑I
i =1
(i )
gU = max
s =1,2 ,…, p −w
∑I
i =1
(i )
sU (11)

At the end of procedure (C), two k-sets of assets are selected out of the p assets in the dataset, having
respectively a mutual low and high extreme values association. This selection is obtained with an algo-
rithmic approach which does not need any prior assumption on the multivariate joint distribution.
After the selection, we are ready to perform a more complete and accurate analysis through the
estimation of tail dependence coefficients with copula functions.

4.3. Simulation Studies in the Lower Tail


As stated before, in this paper we limit our analysis to a financial crisis perspective. Hence, we employ
procedure (A) described above, thus iterating only steps (1)–(7) of procedure (C). In this context we
carry out the following two simulation studies in order to check the performance of the algorithm.

Simulation 1. In the first simulation study, N = 1000 observations are randomly drawn from a mul-
tivariate 25-dimensional Student-t distribution X = (X1, . . .,X25) with zero mean vector and correlation
matrix

⎡P1 0  0⎤
PX = ⎢ 0 P2  0⎥
 ⎥

⎣0 0  P5 ⎦

where 0 is a (5 × 5) matrix of zeros and P1 = . . . = P5 = (1 − ρ)I5 + ρ, with I5 denoting the five-


dimensional identity matrix. In detail, the 25 variables are divided into five blocks. Within each block

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
102 G. DE LUCA ET AL.

Rate of correctly selected variables

1.0
1.0
rate of correct selections

0.8
0.8

0.6
0.6

0.4
0.4

0.2
0.2

0.0
0.0

0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15

correlation coefficient k

Figure 1. Left: rate of correct k-sets versus correlation coefficient (Simulation 1). Right: rate of correctly
selected variables versus k (Simulation 2)

the variables are mutually correlated with correlation coefficient ρ. Variables belonging to different
blocks are uncorrelated. This data-generating process tries to emulate the existence of five different
markets composed of assets associated with each other, but not with assets belonging to other markets.
Since with a multivariate Student-t distribution with ν degrees of freedom, correlation between two
marginal distributions directly reflects on their lower tail dependence, in this situation, a good portfo-
lio should contain assets of different markets. We suppose to desire a k-set composed of k = 5
variables. We decide to set X1 as the main reference variable, thus selecting k −1 = 4 further variables
by means of the proposed heuristic procedure with Lt = 10. Fixing v = 5, the procedure is repeated
r = 50 times for different values of ρ, ranging from 0.05 to 0.9. We consider ‘correct’ a k-set composed
of one variable per block. The rate of correct k-sets rapidly increases when the association among
variables becomes stronger (Figure 1, left). The simulation has been carried out also fixing different
degrees of freedom (v = 15 and v = 30), with the same results.

Simulation 2. In the second simulation study, N = 1000 observations are randomly drawn from a
multivariate 15-dimensional Student-t distribution X = (X1, . . .,X15) with zero mean vector and cor-
relation matrix

⎡ P1 P12 P13 ⎤
PX = ⎢P21 P2 P23 ⎥
⎣ P31 P32 P3 ⎦

where P1 = P2 = P3 = (1 − 0.6)I5 + 0.6 and P12 = P13 = P23 = P′21 = P′31 = P′32 are (5 × 5) matrices
obtained by multiplication of the column vector (0 0.1 0.2 0.3 0.4)′ and the five-dimensional row
vector of ones. Thus, the 15 variables are divided into three blocks. Each variable has a moderately
high correlation (ρ = 0.6) with variables belonging to the same block, but also has an increasing cor-
relation with the variables belonging to different blocks (ρ = 0 with the first variable of each block,
ρ = 0.1 with the second, ρ = 0.2 with the third, ρ = 0.3 with the fourth, ρ = 0.4 with the last). In this
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 103

Table I. Correct selections in Simulation 2 for different values of k

k Correct selections

3 s3 = X1,X6,X11
4 s41 = (s3,X2) s42 = (s3,X7) s43 = (s3,X12)
5 s51 = (s41,X7) s52 = (s42,X12) s53 = (s43,X2)
6 s6 = (s3,X2,X7,X12)
7 s71 = (s6,X3) s72 = (s6,X8) s73 = (s6,X13)
8 s81 = (s71,X8) s82 = (s72,X13) s83 = (s73,X3)
9 s9 = (s6,X3,X8,X13)
10 s101 = (s9,X4) s102 = (s9,X9) s103 = (s9,X14)
11 s111 = (s101,X9) S112 = (s102,X14) S113 = (s103,X4)
12 s12 = (s9,X1,X9,X14)

situation, the selection of the k-set is more challenging. If we set X1 as the main reference variable,
then the algorithm should proceed selecting first one of the first variables of each block, then one of
the second ones, then one of the third ones, and so on, until the remaining k − 1 variables are selected
(Table I).
The procedure is repeated r = 50 times with Lt = 10, for different values of k, from 3 to 12. For
each repetition, let ncv be the number of correct variables in the selected k-set, the average rate of
correctly selected variables

CSV = av r { ncvk −−11}


ranges from 0.644 to 0.89 (Figure 1, right).

5. APPLICATION OF THE HEURISTIC PROCEDURE TO MSCI INDICES

In this section we apply the proposed algorithmic procedure to the dataset described in Section 3.1,
in order to select a k-set with minimum lower tail dependence, choosing the smallest value k ensuring
that coefficient (6) is lower than 0.005. As pointed out in Section 3.1, we first depurate data from
autocorrelation and heteroskedasticity, by means of univariate Student-t AR-GARCH models applied
to the log-returns. Then the procedure is carried out using the standardized residuals, setting Lt = 10
and with MSCI-Italy as main reference asset.
We run the procedure for k = 2, k = 3, . . ., each time computing the coefficient (6) and stopping
when it is lower than the chosen threshold.4 The step-by-step results are as follows:

• k=2
Procedure (A) is applied with MSCI-Italy as reference asset, and the vector ĪL, which we will call
Ī(1)
L because it is computed at the first iteration, is obtained (Figure 2, top left). MSCI-Japan is selected.
The first k-set (k = 2) is then {It, Ja} (consistent with the results obtained in Section 3.1), with
λLIt|Ja = 0.0201. This value is greater than the given threshold of 0.005, so we have to increase k.

4
The computations of the asset selection algorithm are carried out using the library randomForest of the R package (R
Development Core Team, 2006). The R script is available on e-mail request to the corresponding author.

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
104 G. DE LUCA ET AL.

Iteration 1: Italy vs. remaining indexes Iteration 2: Japan vs. remaining indexes

3 3
Average relative importances

Average relative importances


2 2

1 1

0 0

−1 −1

−2 −2

−3 −3
France

Sweden

Australia
Germany
Spain
Netherlands
Belgium
UK
Switzerland
Austria
Portugal

Finland

Australia
Ireland
Norway
Denmark
Canada
USA
Greece

New_Zealand
Singapore
Hong_Kong
Japan

Hong_Kong
Singapore
New_Zealand

Portugal
Norway
Austria
Denmark
Sweden
Greece

Canada
Belgium
Finland
Ireland
UK
Spain
Germany
Netherlands
USA
Switzerland
France
Iteration 3: USA vs. remaining indexes Iteration 4: Greece vs. remaining indexes

3 3
Average relative importances

Average relative importances

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3
Canada

France

Switzerland
Germany

Spain

Belgium
Netherlands
UK
Sweden
Australia
Norway
Ireland
Austria
Denmark
Finland
Portugal
New_Zealand
Hong_Kong

Belgium

Switzerland
Denmark
France
Norway

Netherlands
Singapore
Greece

Sweden
Austria
Portugal
Ireland

Singapore
Spain

Germany

Hong_Kong
Finland
Canada
UK
Australia
New_Zealand

Figure 2. Bar graphs of the standardized variable importance measures Ī (w)


sL for w = 1,2,3,4 and
s = 1,2, . . . ,23 − w

• k=3
Since the algorithmic procedure is hierarchical, the first step is the same as for k = 2. After the
selection of MSCI-Japan, MSCI-Italy is removed from the dataset, procedure (A) is applied with
MSCI-Japan as reference asset and the vector Ī (2)L is obtained (Figure 2, top right). For each index
in the dataset the values in Ī (1) (2)
L and Ī L are summed (Figure 3, top right). MSCI-USA (exhibiting the
minimum summed value) is selected. The second k-set (k = 3) is then {It, Ja, US} (again, consistent
with the results obtained in Section 3.1), with λLIt,Ja|US = 0.0061. We have to increase k.
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 105

Italy Italy, Japan

Sum of average relative importances − Iteration 1+ 2


Sum of average relative importances − Iteration 1

3 3

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3
France
Germany
Spain
Netherlands
Belgium
UK
Switzerland
Austria
Portugal
Sweden
Finland
Ireland
Norway
Denmark
Canada
USA
Greece
Australia
New_Zealand
Singapore
Hong_Kong
Japan

France
Australia
Germany
Spain
Hong_Kong
Netherlands
Singapore
Belgium
UK
Austria
Switzerland
Sweden
Portugal
Norway
Denmark
Finland
New_Zealand
Canada
Ireland
Greece
USA
Sum of average relative importances − Iteration 1+ 2+ 3+ 4
Sum of average relative importances − Iteration 1+ 2+ 3

Italy, Japan, USA Italy, Japan, USA, Greece

3 3

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3
Canada
France

Sweden
Portugal
Norway
Denmark

France
Finland
New_Zealand
Ireland
Greece

Canada

Austria
Germany
Australia
Spain
Hong_Kong
Netherlands
Belgium

Switzerland

Austria
UK

Singapore

Germany

Australia

Netherlands

Switzerland

Norway
Spain
Hong_Kong

Sweden
Belgium

Singapore

Portugal
UK
Ireland

Denmark
Finland
New_Zealand


w (i )
Figure 3. Bar graphs of the summed standardized variable importance measures I
i =1 sL
for w = 1,2,3,4 and
s = 1,2, . . . ,23 − w

• k=4
Again, the first two steps are the same as for k = 3. After the selection of MSCI-USA, MSCI-Japan
is removed from the dataset, procedure (A) is applied with MSCI-USA as reference asset and the
vector Ī (3) (1) (2)
L is obtained (Figure 2, bottom left). For each index in the dataset, the values in Ī L , Ī L
(3)
and Ī L are summed (Figure 3, bottom left). MSCI-Greece (exhibiting the minimum summed value)
is selected. The third k-set (k = 4) is then {It, Ja, US, Gr}, with λLIt,Ja,US|Gr = 0.0055. We have to
increase k.
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
106 G. DE LUCA ET AL.

• k=5
The first three steps are the same as for k = 4. After the selection of MSCI-Greece, MSCI-USA is
removed from the dataset, procedure (A) is applied with MSCI-Greece as reference asset and the
vector Ī (4) (1) (2)
L is obtained (Figure 2, bottom right). For each index in the dataset, the values in Ī L , Ī L ,
(3) (4)
Ī L and Ī L are summed (Figure 3, bottom right). MSCI-New Zealand (exhibiting the minimum
summed value) is selected, with λLIt,Ja,US,Gr|NZ = 0.0015. The desired threshold has been reached, so
the final k-set (k = 5) is given by {It, Ja, US, Gr, NZ}.

Tables II–V show details about the estimation of the parameters of the Joe–Clayton copulae. In
order to check the goodness of fit of the estimated copulae, extending Dobric and Schmid (2006), we
have considered the general null hypothesis that a multivariate data set can be described by a specified
copula

H 0 : ( X1 , X 2 …, X n ) has copula C

Table II. Estimates of the parameters of the Joe–Clayton


copula in the bivariate case (Italy, Japan)

Parameter Estimate St. error

θ 0.1773 0.0289
κ 1.0757 0.0261
CvM 0.1126
|
λLIt Ja 0.0201

Table III. Estimates of the parameters of the Joe–Clayton


copula in the trivariate case (Italy, Japan, USA)

Parameter Estimate St. error

θ 0.2155 0.0191
κ 1.0857 0.0192
CvM 0.4222
|
λLIt,Ja US 0.0061

Table IV. Estimates of the parameters of the


Joe–Clayton copula in the quadrivariate case (Italy,
Japan, USA, Greece)

Parameter Estimate St. error

θ 0.2664 0.0159
κ 1.1007 0.0168
CvM 0.7393
|
λLIt,Ja,US Gr 0.0055

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 107

Table V. Estimates of the parameters of the Joe–Clayton


copula in the five-variate case (Italy, Japan, USA,
Greece, New Zealand)

Parameter Estimate St. error

θ 0.2486 0.0130
κ 1.0912 0.0145
CvM 0.8593
|
λLIt,Ja,US,Gr NZ 0.0015

through the auxiliary hypothesis

H 0* : S ( X1 , X 2 ,…, X n ) ~ χ n2

where

S ( X1 , X 2 ,…, X n ) = [ Φ −1 ( F1 ( X1 ))]2 + [ Φ −1 (C ( F2 ( X 2 ) | F1 ( X1 )))]2 + …+


[ Φ −1 (C ( Fn ( X n ) | F1 ( X1 ), F2 ( X 2 ),…, Fn−1 ( X n−1 )))]2

This hypothesis can be tested using one of the goodness-of-fit tests popularized in the literature, such
as the Cramer–von Mises (CvM) test. In almost all the cases, the hypothesis is accepted at the 1%
significance level (critical value: 0.743). In the last case (n = 5), the test statistic is slightly below the
critical value.

5.1. An Example of Portfolio Selection


In this section we show an example of how the proposed asset selection procedure can be employed
in a simple portfolio selection problem. We use MSCI data, focusing attention on the financial crisis
period that occurred in the second semester of 2008. The asset selection procedure described in Section
5 is then carried out on daily data from 3 June 2002 to 30 May 2008 (1564 observations) and the
selected k-set (k = 5) is {It, Ja, US, Gr, NZ}, just the same as that obtained using the full dataset (until
10 June 2010).
Using the popular Markowitz portfolio selection procedure, we compute the efficient frontier of this
set of indices and compare it with those of 100 randomly selected k-sets containing the main reference
asset MSCI-Italy. It is interesting to note that the portfolios reaching the lowest risk levels are obtained
using the selected k-set (Figure 4, left). Thus, in this example, the asset selection procedure aimed at
minimizing the lower tail dependence provides a k-set able to minimize the traditional risk measure
given by variance, too.
From a financial crisis perspective, we choose the portfolio characterized by the minimum variance
and compare its returns in the crisis period from 2 June 2008 to 31 December 2008 with the corre-
sponding returns obtained with the minimum variance portfolios of the 100 above-mentioned randomly
selected k-sets (Figure 4, right). The selected portfolio outperforms the main part of the competitors
during the crisis period.
Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
108 G. DE LUCA ET AL.

1e−03 Efficient Frontier

0.0
−0.1
8e−04

Portfolio returns with respect to June, 2

−0.2
Target Return[mean]

6e−04

−0.3
−0.4
4e−04

−0.5
2e−04

MV | solveRquadprog

−0.6
0e+00

−0.7
0.006 0.008 0.010 0.012 0 50 100 150

Target Risk[Cov] 2008: June, 2 − December, 31

Figure 4. Left: efficient frontier of the selected set of assets together with the efficient frontiers of 100
randomly selected five-dimensional sets of assets including the reference asset. Right: returns of the minimum
variance portfolio for the selected assets against returns of the 100 minimum variance competitor portfolios in
the second semester of 2008

6. CONCLUDING REMARKS

This paper deals with the problem of selecting a subset of financial assets out of a large set of invest-
ment alternatives. The aim driving the selection is to choose assets having either low association in
the lower tail or high association in the upper tail, according to the investment strategy. We mainly
inspected the former case, assuming a financial crisis perspective for the empirical analysis.
Copula functions are powerful tools for estimating tail dependence coefficients. Nevertheless, we
have found that their use for the above-described selection problem is unfeasible due to an excessive
computational burden. For this reason we have explored the possibility of building a heuristic proce-
dure making use of algorithmic tools widely used in the field of data mining. The proposed selection
procedure has first been checked with two simulation studies, then applied to real data of MSCI indices,
in order to select a subset of developed markets to invest in. In the lower tail, we focused our attention
on the chain effect risk; that is, the probability of a default of the whole set of indices, given the default
of one of them.
The proposed procedure is heuristic and merely descriptive, but can be applied to many empirical
contexts. With large datasets it can provide a good preliminary inspection of the relationships among
variables in the tails of their joint distributions. This makes possible a sort of dimensionality reduction
which allows one to employ more efficient estimation tools, like a copula function.
A number of possible developments in this direction are possible.
First, the procedure can be generalized, avoiding the preliminary definition of a reference asset.
Actually, the idea of choosing a reference asset can be justified in many empirical analyses, when
there effectively exists an asset we want to necessarily include in the portfolio, or when we want to
add new investments to an existing one. Nonetheless, in some cases it can be limiting. The problem
can be easily overcome, simply running the procedure as many times as the number p of variables in

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf
10991174, 2010, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/isaf.315 by CAPES, Wiley Online Library on [19/12/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
COMBINING RANDOM FOREST AND COPULA FUNCTIONS 109

the dataset, each time choosing a different main reference asset. At the end, p different k-sets are
selected. Each one can be deeply analysed by means of copula functions and finally the best one can
be chosen, according to a defined criterion.
Second, in the analysis presented, the chain effect is evaluated at the same time t. In other words,
the coefficient (6) measures the probability of a simultaneous default of all the indices. This is not so
limiting, as it is well known that a default in a developed market rapidly propagates to the others.
However, the inclusion of lagged variables in the dataset could probably improve the results, taking
account of some delay in the default and also of effects due to different time zones.
Third, other types of copula function can be used to estimate the tail dependence coefficients, then
using goodness-of-fit statistics in order to decide which one best reproduces the joint distribution of data.
Finally, some reasoning could be undertaken about the distribution of tail dependence coefficient
estimates.

REFERENCES

Aas K, Berg D. 2009. Models for construction of multivariate dependence—a comparison study. The European
Journal of Finance 15: 639–659.
Bouyé E, Salmon M. 2009. Dynamic copula quantile regressions and tail area dynamic dependence in Forex
markets. The European Journal of Finance 15: 721–750.
Breiman L. 1996. Bagging predictions. Machine Learning 24: 123–140.
Breiman L. 2001. Random forests. Machine Learning 45: 5–32.
Breiman L. 2002. Manual on setting up, using, and understanding random forests v3.1. http://oz.berkeley.edu/
users/breiman.
Breiman L, Friedman JH, Olshen RA, Stone CJ. 1984. Classification and Regression Trees. Chapman & Hall,
New York, 1984.
Cherubini U, Luciano E, Vecchiato W. 2004. Copula Methods in Finance. John Wiley and Sons, Inc., New York.
De Luca G, Rivieccio G. 2010. Multivariate tail dependence coefficients for Archimedean copulae. Unpublished
research.
Dobric J, Schmid F. 2006. A goodness of fit test for copulas based on Rosenblatt’s transformation. Computational
Statistics and Data Analysis 51: 4633–4642.
Engle RF. 2002. Dynamical conditional correlation: a simple class of multivariate generalized autoregressive
heteroscedasticity models. Journal of Business and Economic Statistics 20: 339–350.
Fortin I, Kuzmics C. 2002. Tail-dependence in stock-returns pairs. International Journal of Intelligent Systems
in Accounting, Finance & Management 11: 89–107.
Friedman JH. 2001. Greedy function approximation: a gradient boosting machine. Annals of Statistics 29:
1189–1232.
Friedman JH, Popescu BE. 2005. Predictive learning via rule ensembles. Technical report, Stanford University.
Genest G, Gendronb M, Bourdeau-Brienb M. 2009. The advent of copulas in finance. The European Journal of
Finance 15: 609–618.
Joe H. 1997. Multivariate Models and Dependence Concept. Chapman & Hall, New York.
Jondeau E, Rockinger M. 2006. The copula–GARCH model of conditional dependencies: an international stock
market application. Journal of International Money and Finance 25: 827–853.
McNeil AJ. 1999. Extreme value theory for risk managers. In Internal Modelling and CAD II: Qualifying and
Quantifying Risk within a Financial Institution. RISK Books: London; 93–113.
Nelsen RB. 2006. An Introduction to Copulas. Springer-Verlag, New York.
Palaro HP, Hotta LK. 2006. Using conditional copula to estimate value at risk. Journal of Data Science 4: 93–115.
Patton A. 2006. Modelling asymmetric exchange rate dependence. International Economic Review 47: 527–556.
R Development Core Team. 2006. R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Wien.
Sklar A. 1959. Fonctions de répartition à n dimensions et leurs marges. Publications de l’Institut de Statistique
de l’Université de Paris 8: 229–231.

Copyright © 2010 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 17: 91–109 (2010)
DOI: 10.1002/isaf

You might also like