Super_RaSE_Super_Random_Subspace_Ensemble_Classifi
Super_RaSE_Super_Random_Subspace_Ensemble_Classifi
Article
Super RaSE: Super Random Subspace Ensemble Classification
Jianan Zhu and Yang Feng *
Department of Biostatistics, School of Global Public Health, New York University, New York, NY 10003, USA;
[email protected]
* Correspondence: [email protected]
Abstract: We propose a new ensemble classification algorithm, named super random subspace ensemble
(Super RaSE), to tackle the sparse classification problem. The proposed algorithm is motivated by
the random subspace ensemble algorithm (RaSE). The RaSE method was shown to be a flexible
framework that can be coupled with any existing base classification. However, the success of RaSE
largely depends on the proper choice of the base classifier, which is unfortunately unknown to us. In
this work, we show that Super RaSE avoids the need to choose a base classifier by randomly sampling
a collection of classifiers together with the subspace. As a result, Super RaSE is more flexible and
robust than RaSE. In addition to the vanilla Super RaSE, we also develop the iterative Super RaSE,
which adaptively changes the base classifier distribution as well as the subspace distribution. We
show that the Super RaSE algorithm and its iterative version perform competitively for a wide range
of simulated data sets and two real data examples. The new Super RaSE algorithm and its iterative
version are implemented in a new version of the R package RaSEn.
Keywords: classification; ensemble; subspace; sparsity; feature ranking
the decision boundary nonlinear. Due to this, QDA is also capable of capturing interaction
effects (Tian and Feng 2021a). Another popular classification method is the so-called
logistic regression, which is a special case of generalized linear models. Different from
LDA and QDA, logistic regression directly models the conditional distribution of y, given
the features x, and it is usually viewed as a more robust classifier than LDA. Support vector
machine (SVM) is another type of classifier which seeks the optimal separating hyperplane,
and is shown to outperform many classic ones when there are many features (Steinwart
and Christmann 2008). Breiman et al. (2017) developed classification and regression trees
(CART), which is a pure non-parametric classification method, and is applicable to various
distributions for the features. Another popular non-parametric classification method is the
K nearest neighbor (KNN) proposed in Fix and Hodges (1989). KNN works by a majority
vote from the labels of K training observations that are nearest to the new observation to
be classified.
Although the classifiers described in the previous paragraph have been shown to
work well in many applications, researchers have realized that a single application of
these classifiers could result in unstable classification performance (Dietterich 2000). To
address this issue, there has been a surge of research interest in ensemble learning methods.
Ensemble learning is a general machine learning framework, which combines multiple
learning algorithms to obtain better prediction performance and increase the stability of
any single algorithm (Dietterich 2000; Rokach 2010). Some popular examples include
bagging (Breiman 1996) and random forests (Breiman 2001), which aggregates a collection
of weak learners formed by decision trees. More recent ensemble learning methods include
the random subspace method (Ho 1998), super learner (Van der Laan et al. 2007), model
averaging (Feng and Liu 2020; Feng et al. 2021a; Raftery et al. 1997), random rotation
(Blaser and Fryzlewicz 2016), random projection (Cannings and Samworth 2017; Durrant
and Kabán 2015), and random subspace ensemble classification Tian and Feng (2021a,
2021b).
This paper is largely motivated by the random subspace ensemble (RaSE) classification
framework (Tian and Feng 2021b), which we will briefly review. Suppose we want to
predict the class label y from the feature vector x using n training observations {( xi , yi ), i =
1, . . . , n}. For a given base classifier (e.g., logistic regression), the RaSE algorithm aims
to construct B1 weak learner, where each classifier is formed by applying the specified
base classifier on a properly chosen subspace, which is a subset of the whole feature space
represented as {1, · · · , p}. Taking logistic regression as an example, if the subspace contains
the variables {1, 3, 9}, then we will fit a logistic regression model, but only using the 1st,
3rd, and 9th variables.. To choose each subspace, B2 random subspaces are generated
according to a hierarchical uniform distribution and the optimal one is selected according
to certain criteria (e.g., cross-validation error). The main idea is that among all the B2
candidate subspaces, the one that has the best criteria value tends to have the optimal
quality, which will improve the performance of the final ensemble classifier. In the end, the
predicted labels from the B1 weak learners are averaged and compared to a data-driven
threshold, forming the final classifier. Tian and Feng (2021b) also proposed an iterative
version of RaSE which updates the random subspace distribution according to the selected
proportions of each feature.
Powerful though the RaSE algorithm and its iterative versions are, one major limitation
is that one needs to specify a single base classifier prior to using the RaSE framework. The
success of the RaSE algorithms largely depends on whether the base classifier is suitable for
the application scenario. As shown in the numerical experiments in Tian and Feng (2021b),
the RaSE algorithm could fail to work well if the base classifier is not properly set. In
this regard, blindly applying the RaSE algorithm has the risk of wrongly setting the base
classifier, leading to poor performance.
The aim of this research is to address the limitation of RaSE, where only a single base
classifier can be used. In particular, we aim to replace the single base classifier with a
collection of base classifiers, and also develop a new framework that can adaptively choose
J. Risk Financial Manag. 2021, 14, 612 3 of 18
the appropriate base classifier for a given data set. We call the new ensemble classification
framework the super random subspace ensemble (Super RaSE).
The working mechanism of Super RaSE is that in addition to randomly generating
the subspaces, it also generates the base classifiers to be used together with the subspaces.
More specifically, instead of fixing a base classifier and using it for all subspaces, each time,
the Super RaSE randomly generates the base classifier (from a collection of base classifiers)
and the subspace as a pair, and then picks the best performing pair among the B2 ones via
five-fold cross validation to form one of the B1 weak learners. Then, the predictions of the
B1 weak learners are averaged and compared to a data-driven threshold, forming the final
prediction of Super RaSE.
The main contribution of the paper is three-fold. First, the Super RaSE algorithm
adaptively chooses the base classifier and subspace pair, which makes it a fully model-
free approach. Second, in addition to the accurate prediction performance, the Super
RaSE computes the selected proportion for each base classifier among the B1 weak learners,
implying the appropriateness of each base classifier under the specific scenario, and for each
of the base classifiers, the selected proportion of each feature, measuring the importance
of each feature. Third, we propose an iterative Super RaSE algorithm, which updates
the sampling distribution of base classifiers as well as the sampling distribution of the
subspaces for each base classifier.
The rest of the paper is organized as follows. In Section 2, we introduce the super
random subspace ensemble (SRaSE) classification algorithm as well as its iteration version.
Section 3 conducts extensive simulation studies to show the superior performance of
SRaSE and its iterative version by comparing them with competing methods, including the
original RaSE algorithm. In Section 4, we evaluate the SRaSE algorithms with two real data
sets and show that they perform competitively. Lastly, we conclude the paper with a short
discussion in Section 5.
2. Methods
i.i.d.
Suppose that we have n pairs of observations {( xi , yi ), i = 1, . . . , n} ∼ ( x, y) ∈
R p × {0, 1}, where p is the number of predictors and y ∈ {0, 1} is the class label. We use
SFull = {1, · · · , p} to represent the whole feature set. We assume that the marginal densities
of x for class 0 (y = 0) and 1 (y = 1) exist and are denoted as f (0) and f (1) , respectively.
Thus, the joint distribution of ( x, y) can be described in the following mixture model
where y is a Bernoulli variable with success probability π1 = 1 − π0 ∈ (0, 1). For any
subspace S, we use |S| to denote its cardinality. When restricting to the feature subspace
(0) (1)
S, the corresponding marginal densities of class 0 and 1 are denoted as f S and f S ,
respectively.
Here, we are concerned with a high-dimension classification problem, where the
dimensional p is comparable or even larger than the sample size n. In high-dimensional
problems, we usually believe there are only a handful of features that contribute to the
response, which is usually referred to as the sparse classification problem. For sparse classifi-
cation problems, it is of significance to accurately separate signals from noises. Following
Tian and Feng (2021b), we introduce the definition of a discriminant set.
A feature subset S is called a discriminative set if y is conditionally independent with
xSc given xS , where Sc = SFull \ S. We call S a minimal discriminative set if it has minimal
cardinality among all discriminative sets, and we denote it as S∗ .
Following Tian and Feng (2021b), by default, the subspace distribution D is chosen
as a hierarchical uniform distribution over the subspaces. In particular, with D as the upper
bound of the subspace size, we first generate the subspace size d from the discrete uniform
distribution over {1, · · · , D }. Then, the subspaces {S jk , j = 1, · · · , M, k = 1, · · · , p} are
independent and follow the uniform distribution over the set {S ⊆ SFull : |S| = d}. In
addition, in Step 7 of Algorithm 1, we choose the decision threshold to minimize the
empirical classification error on the training set,
(0) (1)
α̂ = arg min [π̂0 (1 − Ĝn (α)) + π̂1 Ĝn (α)],
α∈[0,1]
where
n
nr = ∑ 1(yi = r), r = 0, 1,
i =1
nr
π̂r = , r = 0, 1,
n
1 n
nr i∑
(r )
Ĝn (α) = 1(yi = r )1(νn ( xi ) ≤ α), r = 0, 1,
=1
J. Risk Financial Manag. 2021, 14, 612 5 of 18
B1
Tj∗ −S j∗
νn ( x) = B1−1 ∑ Cn ( x ).
j =1
In Algorithm 1, there are two important by-products. The first one is the selected
proportion of each method ζ = (ζ 1 , · · · , ζ M ) T out of the B1 weak learners. The higher the
proportion for a method (e.g., KNN), the more appropriate it may be for particular data. In
numerical studies and real data analyses, we provide more interpretations of the results.
Now, let us introduce the second by-product of Algorithm 1. For each method Ti ,
i = 1, · · · , M, we have the selected proportion of each feature ηi = (ηi1 , · · · , ηip ) T . The
feature selection proportion depends on the particular base method. The underlying reason
is that when we use different base methods on the same data, different signals may be found.
For example, if some predictors contribute to the response only through an interaction
effect with other predictors, they may be detected using the quadratic discriminant analysis
(QDA) method; however, they cannot be identified using the linear discriminant analysis
(LDA) method since the LDA only considers the linear effects of features. These base-
classifier-dependent feature selection frequencies will produce a better understanding of
the working mechanism of each base classifier as well as the nature of the importance for
each feature.
In the iterative Super RaSE algorithm, the base classifier distribution is initially set to
be D(0) , which is a uniform distribution over all base classifiers by default. As the iteration
proceeds, the base classifiers that are more frequently selected will have a higher chance
of being selected in the next step, resulting in a different D(t) . The adaptive nature of the
iterative Super RaSE algorithm enables us to discover the best performing base methods
for each data set and in turn, reduce its classification error.
Besides the base classifier distribution, the subspace distribution is also continuously
updated during the iteration process. In our implementation of the algorithm, the initial
subspace distribution for each base classifier Di (0) is the hierarchical uniform distribution
as introduced in Section 2.1. After running the Super RaSE algorithm once, the features
that are more frequently selected are given higher weights in Di (1). In this mechanism, we
give an edge to the useful features, which could further boost the performance. In addition,
for each given base classifier, the selected frequencies of each feature can be viewed as an
importance measure of the features.
J. Risk Financial Manag. 2021, 14, 612 7 of 18
3. Simulation Studies
In this section, we conduct extensive simulation studies on the proposed Super RaSE
algorithm (Algorithm 1) and its iterative version (Algorithm 2) with candidate base clas-
sifiers set as T = { LDA, QDA, KNN }. In addition, we compare their performance with
several competing methods, including the original RaSE with LDA, QDA, and KNN as the
base classifier (Tian and Feng 2021b), as well as LDA, QDA, KNN, and random forest (RF).
We use the default values for all parameters in the Super RaSE algorithm and its iterative
version (B1 = 200, B2 = 500).
For all experiments, we conduct 200 replicates, and report the summary of test errors
(in percentage) in terms of mean and the standard deviation. We use boldface to highlight
the method with minimal test error for each setting, and use italics to highlight the methods
that achieve test errors within one standard deviation of the smallest error.
Table 1. Summary of test classification error rates for each classifier under various sample sizes
over 200 repetitions in Model 1 (LDA). The results are presented as mean values with the standard
deviation values in parentheses. Here, the best performing method in each column is highlighted in
bold and the methods that are within one standard deviation away are highlighted in italic.
As we can see, RaSE1 -KNN performs the best when the sample size n = 200, and
RaSE2 -QDA performs the best for n = 400 and 1000. It is worth noting that the results for
LDA and QDA are NA, due to the small sample size compared with the dimension. By
inspecting the performance of the proposed Super RaSE algorithm and its iterative version,
we can see that although they are not the best performing method, both SRaSE1 and SRaSE2
are within one standard error of the best performing method, showing the robustness of
Super RaSE. Clearly, one iteration helps Super RaSE to have a lower test classification error.
In addition to the test classification error, it is useful to investigate the two by-products
of our algorithms, namely the selected proportion of each base classifier among the B1
classifiers, and the selected proportion of each feature among the weak learners that use a
particular classifier.
Let us take a look at Figure 1. The first row shows the bar charts for the selection
percentage of each base classifier in the Super RaSE algorithm among the 200 repetitions
when the sample size n varies in {200, 400, 1000}. It shows that when n = 200, the
percentage of LDA is around 50%. As the sample size increases, the percentage of LDA
also increases, showing that having a larger sample size helps us to select the model from
which the data are generated.
Now, let us look at the column of n = 1000; we can see that as the iteration process
moves on, the percentage of LDA is increasing as well, leading to almost 100% for SRaSE2 .
The second product of the Super RaSE algorithm and its iterative version is the selected
frequencies for each feature among the weaker learner with a particular classifier. Figure 2
visualizes the selected proportions of features among all the B1 classifiers that use LDA as
the base classifier. In particular, we show the selected proportions for each feature in the
minimum discriminative set S∗ = {1, 2, 5}. In the same figure, we also show a boxplot of
the selected proportion of all the noisy features as a way to verify whether the Super RaSE
algorithms can distinguish the important features from the noisy features.
From Figure 2, we observe that when n = 200, the vanilla Super RaSE algorithm
does not select the important variables 2 and 5 with a high percentage. However, the
iteration greatly helps the algorithm to increase the selected percentages for features 2 and
5. In addition, the increase in sample size leads to the selection of all important features
with almost 100% percentage. It is also worth noting that the noise features all have a
relatively small selection frequency, showing the power of feature ranking in the Super
J. Risk Financial Manag. 2021, 14, 612 9 of 18
RaSE algorithms. Similar figures can also be generated for the selected proportions of
features among the B1 classifiers that use QDA and KNN as the base classifier, respectively.
For simplicity, we omit these figures in our presentation.
75
50
0
25
0
100
75
base
percentage
KNN
50
1
LDA
QDA
25
0
100
75
50
2
25
0
KNN LDA QDA KNN LDA QDA KNN LDA QDA
base
Figure 1. The average selected proportion for each base method for different sample sizes (corre-
sponding to each column) and iteration number (corresponding to each row) in Model 1 (LDA).
75
50
0
25
100
75 feature
percentage
1
50
1
2
5
25
100
75
50
2
25
0
1 2 5 N 1 2 5 N 1 2 5 N
feature
Figure 2. The average selected proportion for each feature for different sample sizes (corresponding
to each column) and iteration number (corresponding to each row) in Model 1 (LDA).
ric matrix with Ω10,10 = −0.3758, Ω10,30 = 0.0616, Ω10,50 = 0.2037, Ω30,30 = −0.5482,
Ω30,50 = 0.0286, Ω50,50 = −0.4614 and all other entries zero. The dimension is p = 200 and
we again vary the sample size n ∈ {200, 400, 1000}.
We can easily verify that the feature subset {1, 2, 10, 30, 50} is the minimal discrimina-
tive set S∗ . In Table 2, we present the performance of different methods for Model 2 under
different sample sizes.
Table 2. Summary of test classification error rates for each classifier under various sample sizes
over 200 repetitions in Model 2 (QDA). The results are presented as mean values with the standard
deviation values in parentheses. Here, the best performing method in each column is highlighted in
bold and the methods that are within one standard deviation away are highlighted in italic.
From the table, when n = 400 and n = 1000, SRaSE2 is the best performing method,
achieving the smallest test classification error. When n = 200, SRaSE2 is within one
standard error away from the optimal method. This result is very encouraging in the sense
that Super RaSE could improve the performance of the original RaSE coupled with the true
model from which the data are generated. This shows that the Super RaSE algorithms are
extremely robust and avoid the need to choose a base classifier as needed in the original
RaSE algorithm.
As in Model 1, we again present the average selected proportion of each base classifier
in Figure 3 as well as the average selected proportion of each feature among the B1 weak
learners with the chosen base classifier being QDA in Figure 4. In Figure 4, we also show a
boxplot of the selected proportion of all the noisy features.
In Figure 3, we again observe that as we iterate the Super RaSE algorithm, the propor-
tion of QDA greatly increases, reaching almost 100 percent for RaSE2 over all sample sizes.
This shows that the iterative Super RaSE algorithm is able to identify the true model.
In Figure 4, we can observe that the average selected proportions of each important
feature are pretty high, showing that the Super RaSE is able to pick up the important
features. In particular, when n = 1000, the iteration helps all the features to have a higher
selected proportion, reaching nearly 100 percent for SRaSE2 .
J. Risk Financial Manag. 2021, 14, 612 11 of 18
75
50
0
25
100
75 base
percentage
KNN
50
1
LDA
QDA
25
100
75
50
2
25
0
KNN LDA QDA KNN LDA QDA KNN LDA QDA
base
Figure 3. The average selected proportion for each base method for different sample sizes (corre-
sponding to each column) and iteration number (corresponding to each row) in Model 2 (QDA).
75
50
0
25
100
feature
75
1
percentage
2
50
1
10
30
25
50
100
75
50
2
25
0
1 2 10 30 50 N 1 2 10 30 50 N 1 2 10 30 50 N
feature
Figure 4. The average selected proportion for each feature for different sample size (corresponding
to each column) and iteration number (corresponding to each row) in Model 2 (QDA).
generating a mixture of 10 Gaussian clusters that surround each of the {z1 , · · · , z10 } when
they are embedded in a p-dimensional space.
From the data generation process, the minimal discriminative set is S∗ = {1, 2, 3, 4, 5}.
We consider p = 200, and n ∈ {200, 400, 1000}. The summary of the test classification error
rates over 200 repetitions is presented in Table 3.
Table 3. Summary of test classification error rates for each classifier under various sample sizes
over 200 repetitions in Model 3 (KNN). The results are presented as mean values with the standard
deviation values in parentheses. Here, the best performing method in each column is highlighted in
bold and the methods that are within one standard deviation away are highlighted in italic.
From Table 3, we can see that Super RaSE and its iterative versions still have very com-
petitive performance compared with other methods. In particular, both SRaSE1 and SRaSE2
reach test errors that are within one standard deviation of the best performing method
RaSE2 -KNN, which uses the knowledge of the data generation process. As a method-free
algorithm, Super RaSE algorithms and their iterative versions are very appealing, having
similar performance as the best-performing ones without the need to specify which base
classifier to use.
Similar to Models 1 and 2, we again present the average selected proportion of each
base method in Figure 5. Figure 6 visualizes the selected proportions of features among all
the B1 classifiers that use KNN as the base classifier. In addition to the bar chart for the
average selected proportion of each important feature, Figure 6 also includes a boxplot of
the selected proportion of all the noisy features.
In Figure 5, we have a similar story as Figures 1 and 3. We can see that the av-
erage selection percentage of KNN is almost 100 percent for both SRaSE1 and SRaSE2 ,
showing that one step iteration is enough to almost always find the best classifier for this
particular model.
In Figure 6, we can see that without iteration, Super RaSE, on average, only selects the
five important features with around 50 percent out of the 200 repetitions. With the help of
iterations, the percentages of all five important features increase substantially, to almost
100 percent when n = 1000. This experiment shows the merit of iterative Super RaSE in
terms of capturing the important features.
J. Risk Financial Manag. 2021, 14, 612 13 of 18
75
50
0
25
100
75 base
percentage
KNN
50
1
LDA
QDA
25
100
75
50
2
25
0
KNN LDA QDA KNN LDA QDA KNN LDA QDA
base
Figure 5. The average selected proportion for each base method for different sample sizes (corre-
sponding to each column) and iteration number (corresponding to each row) in Model 3 (KNN).
75
50
0
25
0
100
feature
75
1
percentage
2
50
1
3
4
25
5
0
100
75
50
2
25
0
1 2 3 4 5 N 1 2 3 4 5 N 1 2 3 4 5 N
feature
Figure 6. The average selected proportion for each feature for different sample sizes (corresponding
to each column) and iteration number (corresponding to each row) in Model 3 (KNN).
Table 4. Summary of test classification error rates for each classifier under various sample sizes over
200 repetitions for the mice protein expression data set. The results are presented as mean values
with the standard deviation values in parentheses. Here, the best performing method in each column
is highlighted in bold and the methods that are within one standard deviation away are highlighted
in italic.
Next, we show the average selected proportion for each base method for different
sample sizes and iteration numbers in Figure 7.
From this figure, we observe a very interesting phenomenon. That is when n = 200,
the selected proportion of LDA is the highest (over 75%). This choice is very reasonable
since comparing the RaSE classifier with different base classifiers, RaSE1 -LDA has the best
performance among RaSE1 combined with other base classifiers. Now, looking at the case
when n = 800, our Super RaSE and SRaSE1 both select KNN over 95% percent of the time.
Again, let us look at the RaSE classifier with a fixed base classifier; it is easy to observe
from Table 4 that RaSE1 -KNN has a much better performance than both RaSE1 -LDA and
RaSE1 -QDA. This shows that the Super RaSE algorithm is very adaptive to the specific
scenario in the sense that in each of the B1 weak learners, it automatically selects the base
classifier among the randomly selected B2 classifiers coupled with the random subspaces.
J. Risk Financial Manag. 2021, 14, 612 15 of 18
100
75
50
0
25
base
percentage
0 KNN
LDA
100
QDA
75
50
1
25
Figure 7. The average selected proportion for each base method for different sample sizes (corre-
sponding to each column) and iteration number (corresponding to each row) for the mice protein
expression data.
Table 5. Summary of test classification error rates for each classifier under various sample sizes over
200 repetitions for the hand-written digits recognition data set. The results are presented as mean
values with the standard deviation values in parentheses. Here, the best performing method in each
column is highlighted in bold and the methods that are within one standard deviation away are
highlighted in italic.
n = 50 n = 100 n = 200
SRaSE 2(1.06) 1.2(0.65) 0.78(0.48)
SRaSE1 1.78(1.03) 1.04(0.57) 0.62(0.37)
RaSE-LDA 1.56(0.85) 1.13(0.59) 0.8(0.54)
RaSE-QDA 2.5(1.47) 1.89(0.91) 1.47(0.96)
RaSE-KNN 1.86(0.96) 1.12(0.66) 0.75(0.45)
RaSE1 -LDA 1.06(0.63) 0.7(0.35) 0.53(0.40)
RaSE1 -QDA 2.18(1.66) 1.18(0.71) 0.85(0.61)
RaSE1 -KNN 1.72(0.95) 1.02(0.62) 0.6(0.44)
LDA NA 1.82(0.96) 1.01(0.56)
QDA NA NA 3.25(2.32)
KNN 1.42(1.32) 0.67(0.41) 0.6(0.47)
RF 2.34(1.24) 1.63(0.73) 1.37(0.74)
J. Risk Financial Manag. 2021, 14, 612 16 of 18
From Table 5, we can see that the Super RaSE and its iterative versions still have
competitive performance, especially when n = 100 and n = 200. SRaSE1 has a very similar
performance as the best performing method when n = 100 and n = 200. Having a careful
look, it is interesting to observe that SRaSE1 closely mimics the performance of RaSE1 -
KNN, leading us to wonder whether KNN is the most selected base method among the B1
classifiers. We confirm that this is the case by presenting the average selected proportion of
each base classifier for SRaSE and SRaSE1 in Figure 8.
From Figure 8, we can see that KNN is indeed the most selected base classifier across
all scenarios, with its selected proportion being almost 100% when n = 200.
50 100 200
60
40
0
20
base
percentage
0 KNN
LDA
QDA
60
40
1
20
Figure 8. The average selected proportion for each base method for different sample sizes (corre-
sponding to each column) and iteration number (corresponding to each row) for the hand-written
digits recognition data.
5. Discussion
In this work, motivated by the random subspace ensemble (RaSE) classification,
we propose a new ensemble classification framework, named Super RaSE, which is a
completely model-free approach. While RaSE has to be paired with a single base classifier,
Super RaSE can work with a collection of base classifiers, making it more flexible and
robust in applications. In particular, we recommend that the practitioners include several
different base classifiers, including parametric and non-parametric ones, so as to make the
Super RaSE framework more powerful. It is worth noting that as we include more base
classifiers in the Super RaSE algorithm, perhaps larger values of B1 and B2 are needed so
as to generate a sufficient number of random subspace and base classifier pairs among
the B2 classifiers, and reach a more stable result with a larger B1 . By randomly generating
the pair of base classifier and subspace, it will automatically select a collection of good
base classifier and subspace pairs. Besides the superb prediction performance on the test
data, the Super RaSE algorithms also provide two important by-products. The first one
is a quality measure of each base classifier, which can provide insight as to which base
classifiers are more appropriate for the given data. The second by-product is that, given
each base classifier, the Super RaSE will also generate an important measure for each
feature, which can be used for feature screening and ranking (Fan et al. 2011; Saldana and
Feng 2018; Yousuf and Feng 2021), just like the original RaSE (Tian and Feng 2021a). The
feature important measure can provide an interpretable machine learning model, which is
very useful in understanding how the method works.
J. Risk Financial Manag. 2021, 14, 612 17 of 18
There are many possible future research directions. First, this paper only considers
the binary classification problem, but there may be more than two classes in applications.
How to extend the Super RaSE algorithm to the multiclass situation is an important topic.
In particular, we need to find multiclass classification algorithms as well as a new decision
function when we ensemble the B1 weak learners. In particular, the original thresholding
step in line 9 of Algorithm 1 is no longer applicable. Second, in addition to the classification
problem, it is also worthwhile to study the corresponding algorithm under a regression
setting. To achieve this, we can change the prediction step from thresholding to a simple
averaging. Third, using the selected proportion of features, it is possible to develop a
variable selection or variable screening algorithm. For example, we can set a threshold for
variable screening which is expected to capture important interaction effects in addition
to the marginal ones. The interaction detection problem has been well studied in high-
dimensional quadratic regression (Hao et al. 2018). However, the Super RaSE will greatly
expand the scope within which the interaction selection works, allowing for the interaction
among features to be of different formats. Lastly, in both RaSE and Super RaSE, we use
all the observations in each combination of the base classifier and a random subspace. We
know that random forest is viewed as a very robust classifier, partly due to the fact that
it uses a bagging sample (Breiman 1996) for each tree during the ensemble process. This
particular bagging step makes the trees uncorrected, which reduces the variable of the final
classifier. A natural extension of Super RaSE is Bagging Super RaSE, which uses a bagging
sample for each of the B1 groups. It is worth noting that it may not be a good idea to use
a different bagging sample for each of the B2 pairs of a base classifier and an associated
random subspace, due to the fact that it could lead to an unfair comparison among the
B2 pairs.
Author Contributions: Conceptualization, Y.F.; methodology, J.Z. and Y.F.; software, J.Z. and Y.F.;
writing—original draft preparation, J.Z. and Y.F.; writing—review and editing, J.Z. and Y.F. All
authors have read and agreed to the published version of the manuscript.
Funding: Feng was partially supported by the NSF CAREER Grant DMS-2013789 and NIH Grant
1R21AG074205-01.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Acknowledgments: The authors thank the editor Syed Ejaz Ahmed, the associate editor, and the
three reviewers for providing many constructive comments that greatly improved the quality and
scope of the paper.
Conflicts of Interest: The authors declare no conflict of interest.
References
Blaser, Rico, and Piotr Fryzlewicz. 2016. Random rotation ensembles. The Journal of Machine Learning Research 17: 126–51.
Breiman, Leo. 1996. Bagging predictors. Machine Learning 24: 123–40. [CrossRef]
Breiman, Leo. 2001. Random forests. Machine Learning 45: 5–32. [CrossRef]
Breiman, Leo, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. 2017. Classification and Regression Trees. Boca Raton:
Routledge.
Cannings, Timothy I., and Richard J. Samworth. 2017. Random-projection ensemble classification. Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 79: 959–1035. [CrossRef]
Dainotti, Alberto, Antonio Pescape, and Kimberly C. Claffy. 2012. Issues and future directions in traffic classification. IEEE Network 26:
35–40. [CrossRef]
Dietterich, Thomas G. 2000. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems. New York:
Springer, pp. 1–15.
Dua, Dheeru, and Casey Graff. 2019. Uci Machine Learning Repository. Irvine: School of Information and Computer Science, University
of California.
Durrant, Robert J., and Ata Kabán. 2015. Random projections as regularizers: Learning a linear discriminant from fewer observations
than dimensions. Machine Learning 99: 257–86. [CrossRef]
J. Risk Financial Manag. 2021, 14, 612 18 of 18
Dvorsky, Jan, Jaroslav Belas, Beata Gavurova, and Tomas Brabenec. 2021. Business risk management in the context of small and
medium-sized enterprises. Economic Research-Ekonomska Istraživanja 34: 1690–708. [CrossRef]
Fan, Jianqing, Yang Feng, Jiancheng Jiang, and Xin Tong. 2016. Feature augmentation via nonparametrics and selection (fans) in
high-dimensional classification. Journal of the American Statistical Association 111: 275–87. [CrossRef] [PubMed]
Fan, Jianqing, Yang Feng, and Rui Song. 2011. Nonparametric independence screening in sparse ultra-high-dimensional additive
models. Journal of the American Statistical Association 106: 544–57. [CrossRef]
Fan, Jianqing, Yang Feng, and Xin Tong. 2012. A road to classification in high dimensional space: The regularized optimal affine
discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74: 745–71. [CrossRef]
Fan, Yingying, Yinfei Kong, Daoji Li, and Zemin Zheng. 2015. Innovated interaction screening for high-dimensional nonlinear
classification. The Annals of Statistics 43: 1243–72. [CrossRef]
Feng, Yang, and Qingfeng Liu. 2020. Nested model averaging on solution path for high-dimensional linear regression. Stat 9: e317.
[CrossRef]
Feng, Yang, Qingfeng Liu, Qingsong Yao, and Guoqing Zhao. 2021a. Model averaging for nonlinear regression models. Journal of
Business & Economic Statistics 1–14. [CrossRef]
Feng, Yang, Xin Tong, and Weining Xin. 2021b. Targeted Crisis Risk Control: A Neyman-Pearson Approach. Available online:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3945980 (accessed on 27 September 2020).
Fix, Evelyn, and Joseph Lawson Hodges. 1989. Discriminatory analysis. nonparametric discrimination: Consistency properties.
International Statistical Review/Revue Internationale de Statistique 57: 238–47. [CrossRef]
Gao, Xiaoli, S. E. Ahmed, and Yang Feng. 2017. Post selection shrinkage estimation for high-dimensional data analysis. Applied
Stochastic Models in Business and Industry 33: 97–120. [CrossRef]
Hao, Ning, Yang Feng, and Hao Helen Zhang. 2018. Model selection for high-dimensional quadratic regression via regularization.
Journal of the American Statistical Association 113: 615–25. [CrossRef]
Higuera, Clara, Katheleen J. Gardiner, and Krzysztof J. Cios. 2015. Self-organizing feature maps identify proteins critical to learning in
a mouse model of down syndrome. PLoS ONE 10: e0129126. [CrossRef]
Ho, Tin Kam. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine
Intelligence 20: 832–44.
Jurgovsky, Johannes, Michael Granitzer, Konstantin Ziegler, Sylvie Calabretto, Pierre-Edouard Portier, Liyun He-Guelton, and Olivier
Caelen. 2018. Sequence classification for credit-card fraud detection. Expert Systems with Applications 100: 234–45. [CrossRef]
Kotsiantis, Sotiris B., I. Zaharakis, and P. Pintelas. 2007. Supervised machine learning: A review of classification techniques. Emerging
Artificial Intelligence Applications in Computer Engineering 160: 3–24.
Mai, Qing, Hui Zou, and Ming Yuan. 2012. A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99:
29–42. [CrossRef]
Michalski, Grzegorz, Małgorzata Rutkowska-Podołowska, and Adam Sulich. 2018. Remodeling of fliem: The cash management in
polish small and medium firms with full operating cycle in various business environments. In Efficiency in Business and Economics.
Berlin/Heidelberg: Springer, pp. 119–32.
Raftery, Adrian E., David Madigan, and Jennifer A. Hoeting. 1997. Bayesian model averaging for linear regression models. Journal of
the American Statistical Association 92: 179–91. [CrossRef]
Rokach, Lior. 2010. Ensemble-based classifiers. Artificial Intelligence Review 33: 1–39. [CrossRef]
Saldana, Diego Franco, and Yang Feng. 2018. Sis: An r package for sure independence screening in ultrahigh-dimensional statistical
models. Journal of Statistical Software 83: 1–25. [CrossRef]
Steinwart, Ingo, and Andreas Christmann. 2008. Support Vector Machines. Berlin/Heidelberg: Springer Science & Business Media.
Szczygieł, N., M. Rutkowska-Podolska, and Grzegorz Michalski. 2014. Information and communication technologies in healthcare:
Still innovation or reality? Innovative and entrepreneurial value-creating approach in healthcare management. Paper presented
at the 5th Central European Conference in Regional Science, Košice, Slovakia, October 5–8. pp. 1020–29.
Tian, Ye, and Yang Feng. 2021a. Rase: A variable screening framework via random subspace ensembles. Journal of the American
Statistical Association 1–30, accepted. [CrossRef]
Tian, Ye, and Yang Feng. 2021b. Rase: Random subspace ensemble classification. J. Mach. Learn. Res. 22: 1–93.
Tong, Xin, Yang Feng, and Jingyi Jessica Li. 2018. Neyman-pearson classification algorithms and np receiver operating characteristics.
Science Advances 4: eaao1659. [CrossRef]
Van der Laan, Mark J., Eric C. Polley, and Alan E. Hubbard. 2007. Super learner. Statistical Applications in Genetics and Molecular
Biology 6. [CrossRef] [PubMed]
Yousuf, Kashif, and Yang Feng. 2021. Targeting predictors via partial distance correlation with applications to financial forecasting.
Journal of Business & Economic Statistics 1–13. [CrossRef]