0% found this document useful (0 votes)

6 views18 pages

Super_RaSE_Super_Random_Subspace_Ensemble_Classifi

The article introduces a new ensemble classification algorithm called Super Random Subspace Ensemble (Super RaSE), designed to address sparse classification problems without the need to select a single base classifier. Super RaSE randomly samples both classifiers and subspaces, enhancing flexibility and robustness compared to its predecessor, the Random Subspace Ensemble (RaSE). The authors demonstrate the effectiveness of Super RaSE through simulations and real data applications, and provide implementation in the R package RaSEn.

Uploaded by

svppwhxkj8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views18 pages

Super_RaSE_Super_Random_Subspace_Ensemble_Classifi

Uploaded by

svppwhxkj8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Journal of

Risk and Financial

Management

Article
Super RaSE: Super Random Subspace Ensemble Classification
Jianan Zhu and Yang Feng *

Department of Biostatistics, School of Global Public Health, New York University, New York, NY 10003, USA;
[email protected]
* Correspondence: [email protected]

Abstract: We propose a new ensemble classification algorithm, named super random subspace ensemble
(Super RaSE), to tackle the sparse classification problem. The proposed algorithm is motivated by
the random subspace ensemble algorithm (RaSE). The RaSE method was shown to be a flexible
framework that can be coupled with any existing base classification. However, the success of RaSE
largely depends on the proper choice of the base classifier, which is unfortunately unknown to us. In
this work, we show that Super RaSE avoids the need to choose a base classifier by randomly sampling
a collection of classifiers together with the subspace. As a result, Super RaSE is more flexible and
robust than RaSE. In addition to the vanilla Super RaSE, we also develop the iterative Super RaSE,
which adaptively changes the base classifier distribution as well as the subspace distribution. We
show that the Super RaSE algorithm and its iterative version perform competitively for a wide range
of simulated data sets and two real data examples. The new Super RaSE algorithm and its iterative
version are implemented in a new version of the R package RaSEn.

Keywords: classification; ensemble; subspace; sparsity; feature ranking

Citation: Zhu, Jianan, and Yang

Feng. 2021. Super RaSE: Super
Random Subspace Ensemble 1. Introduction
Classification. Journal of Risk and Classification is an important research topic with applications in a wide range of
Financial Management 14: 612. subjects, including finance, engineering, economics, and medicine (Kotsiantis et al. 2007).
https://doi.org/10.3390/ In a classification problem, we observe a collection of observations with p features x and
jrfm14120612 their associate class label y. The main goal is to construct a classifier, which is a mapping
from the space of x to the label space of y. The most studied problem is the so-called
Academic Editor: S. Ejaz Ahmed
binary classification setup, where we usually have y ∈ {0, 1}. For credit cards, it is
important to use the features of customers and merchants to predict whether each swipe
Received: 3 October 2021
is fraudulent or not (Jurgovsky et al. 2018). One important application of classification
Accepted: 30 November 2021
in engineering is to perform traffic classification (Dainotti et al. 2012), which could be
Published: 17 December 2021
very beneficial to improve traffic. For example, predicting an economic crisis is critical to
maximize the overall welfare (Feng et al. 2021b). When the number of features is large,
Publisher’s Note: MDPI stays neutral
we usually call the classification problem a high-dimensional one. Fan et al. (2012), Mai
with regard to jurisdictional claims in
published maps and institutional affil-
et al. (2012), and Fan et al. (2016) studied high-dimensional classification and showed the
iations.
application of tumor type classification using high-dimensional gene expression data
sets. Gao et al. (2017) further studies the post selection shrinkage method to improve the
classical penalized estimator for high-dimensional classification problem. Some other
related work include (Szczygieł et al. 2014; Michalski et al. 2018; Tong et al. 2018; and
Dvorsky et al. 2021).
Copyright: © 2021 by the authors.
There are many well-known classifiers that have been proposed during the recent
Licensee MDPI, Basel, Switzerland.
decades. First, we have the classic linear discriminant analysis (LDA), where we first
This article is an open access article
distributed under the terms and
divide the observations into classes where each class corresponds to a specific value of y
conditions of the Creative Commons
(e.g., y = 0 and y = 1), then we assume that the features in each of the two classes follow
Attribution (CC BY) license (https://
Gaussian distribution with the same covariance matrix and usually different means. This
creativecommons.org/licenses/by/ particular Gaussian assumption leads to a linear decision boundary for LDA. Quadratic
4.0/). discriminant analysis (QDA) relaxes the equal covariance matrix assumption, which makes

J. Risk Financial Manag. 2021, 14, 612. https://doi.org/10.3390/jrfm14120612 https://www.mdpi.com/journal/jrfm

J. Risk Financial Manag. 2021, 14, 612 2 of 18

the decision boundary nonlinear. Due to this, QDA is also capable of capturing interaction
effects (Tian and Feng 2021a). Another popular classification method is the so-called
logistic regression, which is a special case of generalized linear models. Different from
LDA and QDA, logistic regression directly models the conditional distribution of y, given
the features x, and it is usually viewed as a more robust classifier than LDA. Support vector
machine (SVM) is another type of classifier which seeks the optimal separating hyperplane,
and is shown to outperform many classic ones when there are many features (Steinwart
and Christmann 2008). Breiman et al. (2017) developed classification and regression trees
(CART), which is a pure non-parametric classification method, and is applicable to various
distributions for the features. Another popular non-parametric classification method is the
K nearest neighbor (KNN) proposed in Fix and Hodges (1989). KNN works by a majority
vote from the labels of K training observations that are nearest to the new observation to
be classified.
Although the classifiers described in the previous paragraph have been shown to
work well in many applications, researchers have realized that a single application of
these classifiers could result in unstable classification performance (Dietterich 2000). To
address this issue, there has been a surge of research interest in ensemble learning methods.
Ensemble learning is a general machine learning framework, which combines multiple
learning algorithms to obtain better prediction performance and increase the stability of
any single algorithm (Dietterich 2000; Rokach 2010). Some popular examples include
bagging (Breiman 1996) and random forests (Breiman 2001), which aggregates a collection
of weak learners formed by decision trees. More recent ensemble learning methods include
the random subspace method (Ho 1998), super learner (Van der Laan et al. 2007), model
averaging (Feng and Liu 2020; Feng et al. 2021a; Raftery et al. 1997), random rotation
(Blaser and Fryzlewicz 2016), random projection (Cannings and Samworth 2017; Durrant
and Kabán 2015), and random subspace ensemble classification Tian and Feng (2021a,
2021b).
This paper is largely motivated by the random subspace ensemble (RaSE) classification
framework (Tian and Feng 2021b), which we will briefly review. Suppose we want to
predict the class label y from the feature vector x using n training observations {( xi , yi ), i =
1, . . . , n}. For a given base classifier (e.g., logistic regression), the RaSE algorithm aims
to construct B1 weak learner, where each classifier is formed by applying the specified
base classifier on a properly chosen subspace, which is a subset of the whole feature space
represented as {1, · · · , p}. Taking logistic regression as an example, if the subspace contains
the variables {1, 3, 9}, then we will fit a logistic regression model, but only using the 1st,
3rd, and 9th variables.. To choose each subspace, B2 random subspaces are generated
according to a hierarchical uniform distribution and the optimal one is selected according
to certain criteria (e.g., cross-validation error). The main idea is that among all the B2
candidate subspaces, the one that has the best criteria value tends to have the optimal
quality, which will improve the performance of the final ensemble classifier. In the end, the
predicted labels from the B1 weak learners are averaged and compared to a data-driven
threshold, forming the final classifier. Tian and Feng (2021b) also proposed an iterative
version of RaSE which updates the random subspace distribution according to the selected
proportions of each feature.
Powerful though the RaSE algorithm and its iterative versions are, one major limitation
is that one needs to specify a single base classifier prior to using the RaSE framework. The
success of the RaSE algorithms largely depends on whether the base classifier is suitable for
the application scenario. As shown in the numerical experiments in Tian and Feng (2021b),
the RaSE algorithm could fail to work well if the base classifier is not properly set. In
this regard, blindly applying the RaSE algorithm has the risk of wrongly setting the base
classifier, leading to poor performance.
The aim of this research is to address the limitation of RaSE, where only a single base
classifier can be used. In particular, we aim to replace the single base classifier with a
collection of base classifiers, and also develop a new framework that can adaptively choose
J. Risk Financial Manag. 2021, 14, 612 3 of 18

the appropriate base classifier for a given data set. We call the new ensemble classification
framework the super random subspace ensemble (Super RaSE).
The working mechanism of Super RaSE is that in addition to randomly generating
the subspaces, it also generates the base classifiers to be used together with the subspaces.
More specifically, instead of fixing a base classifier and using it for all subspaces, each time,
the Super RaSE randomly generates the base classifier (from a collection of base classifiers)
and the subspace as a pair, and then picks the best performing pair among the B2 ones via
five-fold cross validation to form one of the B1 weak learners. Then, the predictions of the
B1 weak learners are averaged and compared to a data-driven threshold, forming the final
prediction of Super RaSE.
The main contribution of the paper is three-fold. First, the Super RaSE algorithm
adaptively chooses the base classifier and subspace pair, which makes it a fully model-
free approach. Second, in addition to the accurate prediction performance, the Super
RaSE computes the selected proportion for each base classifier among the B1 weak learners,
implying the appropriateness of each base classifier under the specific scenario, and for each
of the base classifiers, the selected proportion of each feature, measuring the importance
of each feature. Third, we propose an iterative Super RaSE algorithm, which updates
the sampling distribution of base classifiers as well as the sampling distribution of the
subspaces for each base classifier.
The rest of the paper is organized as follows. In Section 2, we introduce the super
random subspace ensemble (SRaSE) classification algorithm as well as its iteration version.
Section 3 conducts extensive simulation studies to show the superior performance of
SRaSE and its iterative version by comparing them with competing methods, including the
original RaSE algorithm. In Section 4, we evaluate the SRaSE algorithms with two real data
sets and show that they perform competitively. Lastly, we conclude the paper with a short
discussion in Section 5.

2. Methods
i.i.d.
Suppose that we have n pairs of observations {( xi , yi ), i = 1, . . . , n} ∼ ( x, y) ∈
R p × {0, 1}, where p is the number of predictors and y ∈ {0, 1} is the class label. We use
SFull = {1, · · · , p} to represent the whole feature set. We assume that the marginal densities
of x for class 0 (y = 0) and 1 (y = 1) exist and are denoted as f (0) and f (1) , respectively.
Thus, the joint distribution of ( x, y) can be described in the following mixture model

x|y = y0 ∼ (1 − y0 ) f (0) + y0 f (1) , y0 = 0, 1,

where y is a Bernoulli variable with success probability π1 = 1 − π0 ∈ (0, 1). For any
subspace S, we use |S| to denote its cardinality. When restricting to the feature subspace
(0) (1)
S, the corresponding marginal densities of class 0 and 1 are denoted as f S and f S ,
respectively.
Here, we are concerned with a high-dimension classification problem, where the
dimensional p is comparable or even larger than the sample size n. In high-dimensional
problems, we usually believe there are only a handful of features that contribute to the
response, which is usually referred to as the sparse classification problem. For sparse classifi-
cation problems, it is of significance to accurately separate signals from noises. Following
Tian and Feng (2021b), we introduce the definition of a discriminant set.
A feature subset S is called a discriminative set if y is conditionally independent with
xSc given xS , where Sc = SFull \ S. We call S a minimal discriminative set if it has minimal
cardinality among all discriminative sets, and we denote it as S∗ .

2.1. Super Random Subspace Ensemble Classification (SRaSE)

Here, to train each weak learner (e.g., the j-th one), B2 independent random subspaces
are generated as {S j1 , · · · , S jB2 } from subspace distribution D , and B2 base classifiers
{ Tj1 , · · · , TjB2 } are sampled with replacement from base classifier distribution D on the
J. Risk Financial Manag. 2021, 14, 612 4 of 18

candidate base classifier set T = { T1 , · · · , TM }. By default, we will be using a uniform

distribution for D. However, users can use other distributions in the algorithm if they
have some prior belief on which classifiers may work better. Then, for k = 1, · · · , B2 , we
train the classifier Tjk , using only the features in S jk . We then choose the optimal subspace
S j∗ and base classifier Tj∗ pair using 5-fold cross validation. We denote the base classifier
Tj∗ −S j∗
Tj∗ applied on subspace S j∗ as Cn , where the subscript n is used to emphasize the
classifier depending on the sample with n observations. Finally, we aggregate outputs of
T −S
{Cn j∗ j∗ } Bj=1 1 to form the final decision function by taking a simple average. The whole
procedure can be summarized in Algorithm 1.

Algorithm 1: Super random subspace ensemble classification (SRaSE)

Input: training data {( xi , yi )}in=1 , new data x, subspace distribution D , integers B1
and B2 , the candidate base classifier set T , base classifier distribution D
Output: predicted label CnRaSE ( x), the selected proportion of each base classifier ζ,
and for the base classifier Ti ∈ T where i ∈ {1, · · · , M}, the selected
proportion of each feature ηi
1 Independently generate base classifiers Tjk ∼ D, 1 ≤ j ≤ B1 , 1 ≤ k ≤ B2
2 Independently generate random subspaces S jk ∼ D , 1 ≤ j ≤ B1 , 1 ≤ k ≤ B2
3 for j ← 1 to B1 do
4 Select the optimal subspace and base classifier pair ( Tj∗ , S j∗ ) from
{( Tjk , S jk )}kB=2 1 using 5-fold cross-validation.
5 end
B Tj∗ −S j∗
6 Construct the ensemble decision function νn ( x) = B1−1 ∑ j=1 1 Cn ( x)
7 Set the threshold α̂ according to (1)
8 Compute the selected proportion of each method ζ = (ζ 1 , · · · , ζ M ) T , where
B
ζ i = B1−1 ∑ j=1 1 1(i ∈ Tj∗ )
9 For each method Ti , i = 1, · · · , M, compute the selected proportion of each feature
B
ηi = (ηi1 , · · · , ηip ) T , where ηil = (ζ i B1 )−1 ∑ j=1 1 1(i ∈ Tj∗ )1(l ∈ S j∗ ), l = 1, · · · , p
10 Output the predicted label CnRaSE ( x) = 1(νn ( x) > α̂), the selected proportion of
each method ζ = (ζ 1 , · · · , ζ M ) T , and the selected proportion of each feature for
each method ηi = (ηi1 , · · · , ηip ) T

Following Tian and Feng (2021b), by default, the subspace distribution D is chosen
as a hierarchical uniform distribution over the subspaces. In particular, with D as the upper
bound of the subspace size, we first generate the subspace size d from the discrete uniform
distribution over {1, · · · , D }. Then, the subspaces {S jk , j = 1, · · · , M, k = 1, · · · , p} are
independent and follow the uniform distribution over the set {S ⊆ SFull : |S| = d}. In
addition, in Step 7 of Algorithm 1, we choose the decision threshold to minimize the
empirical classification error on the training set,

(0) (1)
α̂ = arg min [π̂0 (1 − Ĝn (α)) + π̂1 Ĝn (α)],
α∈[0,1]

where
n
nr = ∑ 1(yi = r), r = 0, 1,
i =1
nr
π̂r = , r = 0, 1,
n
1 n
nr i∑
(r )
Ĝn (α) = 1(yi = r )1(νn ( xi ) ≤ α), r = 0, 1,
=1
J. Risk Financial Manag. 2021, 14, 612 5 of 18

B1
Tj∗ −S j∗
νn ( x) = B1−1 ∑ Cn ( x ).
j =1

In Algorithm 1, there are two important by-products. The first one is the selected
proportion of each method ζ = (ζ 1 , · · · , ζ M ) T out of the B1 weak learners. The higher the
proportion for a method (e.g., KNN), the more appropriate it may be for particular data. In
numerical studies and real data analyses, we provide more interpretations of the results.
Now, let us introduce the second by-product of Algorithm 1. For each method Ti ,
i = 1, · · · , M, we have the selected proportion of each feature ηi = (ηi1 , · · · , ηip ) T . The
feature selection proportion depends on the particular base method. The underlying reason
is that when we use different base methods on the same data, different signals may be found.
For example, if some predictors contribute to the response only through an interaction
effect with other predictors, they may be detected using the quadratic discriminant analysis
(QDA) method; however, they cannot be identified using the linear discriminant analysis
(LDA) method since the LDA only considers the linear effects of features. These base-
classifier-dependent feature selection frequencies will produce a better understanding of
the working mechanism of each base classifier as well as the nature of the importance for
each feature.

2.2. Iterative Super RaSE

Motivated by the fact that some base classifiers may be more appropriate under the
given scenario, it could be a good idea to leverage the information we learned during the
Super RaSE algorithm. In particular, the selected proportion of each method ζ could be
used to adjust the sampling distribution of the methods. In the new iterative algorithm, we
will update the base classifier distribution D by setting the probability mass function as
the selected proportion of each method from the last iteration. This updating scheme will
make the better-performing classifiers have a higher chance of being sampled in the next
iteration, which could potentially increase the quality of the B1 classifiers. On the other
hand, if a certain base classifier was rarely selected in the B1 classifiers, perhaps we want to
down-weigh it in the next iteration.
In addition, regarding the feature selection proportion for each base method, we adopt
a similar updating strategy as that in the original RaSE (Tian and Feng 2021b). However, in
Super RaSE, we are updating the random subspace distribution separately for each base
(0)
method. For base classifier i, the initial subspace distribution is denoted by Di , where
the subscript i represents the i-th base classifier and superscript (0) represents the 0-th
(t)
iteration. In the t-th iteration, we use Di to represent the distribution of the subspace
distribution for base classifier i.
The updating scheme for the subspace distribution is as follows. For method i, we
again first generate the subspace size d from the uniform distribution over {1, · · · , D } as
before. It is easy to observe that each subspace S can be equivalently represented as a
binary p-dimensional vector representing whether each feature is included in S. To be more
specific, the equivalent vector representation subspace S is J = ( J1 , · · · , J p ) T , where Jl =
1(l ∈ S), l = 1, · · · , p. Then, we generate J from a restrictive multinomial distribution with
parameter ( p, d, η̃i ), where η̃i = (η̃i1 , · · · , η̃ip ) T , η̃il = ηil 1(ηil > C0 / log p) + Cp0 1(ηil ≤
C0 / log p), and the restriction is that Jl ∈ {0, 1}, l = 1, · · · , p, where C0 is a constant. Note
that here, the parameter η̃i characterizes the subspace distribution.
We named the iterative algorithm Iterative Super RaSE, the details of which are sum-
marized in Algorithm 2.
J. Risk Financial Manag. 2021, 14, 612 6 of 18

Algorithm 2: Iterative Super RaSE (SRaSET )

Input: training data {( xi , yi )}in=1 , new data x, integers B1 and B2 , the candidate
base classifier set T , initial base classifier distribution D(0) , initial subspace
(0)
distribution for each base classifier {Di , i = 1, · · · , T }, the number of
iterations T
SRaSET
Output: predicted label Cn ( x), the selected proportion of each base
( T )
classifier ζ , and for the base classifier Ti ∈ T where i ∈ {1, · · · , M},
(T )
the selected proportion of each feature ηi
1 for t ← 0 to T do
2 Independently generate base classifiers Tjk ∼ D(t) , 1 ≤ j ≤ B1 , 1 ≤ k ≤ B2
(t) (t)
3 Independently generate random subspaces S jk ∼ Di , 1 ≤ j ≤ B1 , 1 ≤ k ≤ B2
4 for j ← 1 to B1 do
(t) (t)
5 Select the optimal subspace and base classifier pair (S j∗ , Tj∗ ) from
(t) (t)
{(S jk , Tjk )}kB=2 1 using 5-fold cross-validation.
6 end
7 Set D(t+1) to be a discrete distribution over the candidate base classifier set T ,
(t)
where for each base classifier Ti ∈ T , P( Ti ) = ζ i , where
(t) (t)
ζi = B1−1 ∑ Bj=1 1 1(i ∈ Tj∗ )
8 For each method Ti , i = 1, · · · , M, compute
(t) (t) B (t) (t)
ηil = (ζ i B1 )−1 ∑ j=1 1 1(i ∈ Tj∗ )1(l ∈ S j∗ ), l = 1, · · · , p
9 Set D (t+1) to be a restrictive multinomial distribution with parameter
(t) (t) (t) (t) (t)
( p, d, η̃i ), where η̃il = ηil 1(ηil > C0 / log p) + Cp0 1(ηil ≤ C0 / log p) and
d is sampled from the uniform distribution over {1, · · · , D }
10 end
11 Set the threshold α̂ according to (1)
(T ) (T )
B S j∗ − Tj∗
12 Construct the ensemble decision function νn ( x) = B1 ∑ j=1 1 Cn ( x)
1
SRaSET
13 Output the predicted label Cn ( x) = 1(νn ( x) > α̂), the selected proportion
(T ) (T )
of each base classifier ζ (T ) = (ζ 1 , · · · , ζ M ) T , and the selected proportion of
(T ) (T ) (T )
each feature for each base classifier ηi = (ηi1 , · · · , ηip )T

In the iterative Super RaSE algorithm, the base classifier distribution is initially set to
be D(0) , which is a uniform distribution over all base classifiers by default. As the iteration
proceeds, the base classifiers that are more frequently selected will have a higher chance
of being selected in the next step, resulting in a different D(t) . The adaptive nature of the
iterative Super RaSE algorithm enables us to discover the best performing base methods
for each data set and in turn, reduce its classification error.
Besides the base classifier distribution, the subspace distribution is also continuously
updated during the iteration process. In our implementation of the algorithm, the initial
subspace distribution for each base classifier Di (0) is the hierarchical uniform distribution
as introduced in Section 2.1. After running the Super RaSE algorithm once, the features
that are more frequently selected are given higher weights in Di (1). In this mechanism, we
give an edge to the useful features, which could further boost the performance. In addition,
for each given base classifier, the selected frequencies of each feature can be viewed as an
importance measure of the features.
J. Risk Financial Manag. 2021, 14, 612 7 of 18

2.3. Parameter Specification

In the Super RaSE algorithm and its iterative version, we discussed the base classifier
distribution and the subspace distribution for each base classifier in Sections 2.1 and 2.2.
There are a few additional parameters that need to be specified. We now discuss the details
with the default values in the algorithms.
• B1 is the number of weak classifiers we want to construct during the algorithm and
average over at the end. It is set to be 200 by default. Usually, the larger the B1 , the
more robust the algorithm is.
• B2 is the number of base classifier and subspace pairs among which we choose
the optimal one. It is set to be 500 by default. To make sure each base classifier
and the subspace combination has a fair chance of being evaluated, B2 has to be
reasonably large.
• D is the maximum subspace size in the subspace distribution. Following Tian and
√ √
Feng (2021b),√we set D = min( p, [ n0 ], [ n1 ]) for the QDA base classifier and
D = min( p, [ n]) for all the other base classifiers, where [ a] denotes the largest integer
not larger than a.
• T is the candidate base classifier set. In theory, it could contain any classifier and the
Super RaSE algorithm will automatically select the better-performing ones in the final
B1 weak learners. On the other hand, the more base classifiers we put inside T , each
base classifier has a smaller chance of being sampled in the B2 pairs of base classifier
and subspace, leading to an increased risk of missing the optimal combination. In our
implementation, we set the candidate base classifier set T = {LDA, QDA, KNN}.
• T is the number of iterations in the iterative Super RaSE. From our limited numerical
experience, the first two rounds of iterations lead to the most performance improve-
ment. As a result, we use T = 2 for simulation and T = 1 for real data analysis.

3. Simulation Studies
In this section, we conduct extensive simulation studies on the proposed Super RaSE
algorithm (Algorithm 1) and its iterative version (Algorithm 2) with candidate base clas-
sifiers set as T = { LDA, QDA, KNN }. In addition, we compare their performance with
several competing methods, including the original RaSE with LDA, QDA, and KNN as the
base classifier (Tian and Feng 2021b), as well as LDA, QDA, KNN, and random forest (RF).
We use the default values for all parameters in the Super RaSE algorithm and its iterative
version (B1 = 200, B2 = 500).
For all experiments, we conduct 200 replicates, and report the summary of test errors
(in percentage) in terms of mean and the standard deviation. We use boldface to highlight
the method with minimal test error for each setting, and use italics to highlight the methods
that achieve test errors within one standard deviation of the smallest error.

3.1. Model 1: LDA

The first model we consider is a sparse LDA model with settings in Mai et al. (2012)
and Tian and Feng (2021b). In particular, assume x|y = r ∼ N (µ(r) , Σ), r = 0, 1, where
Σ = (Σij ) p× p = (0.5|i− j| ) p× p , µ(0) = 0 p×1 , µ(1) = Σ × 0.556(3, 1.5, 0, 0, 2, 01×( p−5) ) T . The
dimension p = 400, and we vary the training sample size n ∈ {200, 400, 1000}. We also
independently generate a test data of size 1000.
It is easy to verify that the feature subset {1, 2, 5} is the minimal discriminative set S∗ .
In Table 1, the performance of various methods for Model 1 under different sample sizes
are presented.
J. Risk Financial Manag. 2021, 14, 612 8 of 18

Table 1. Summary of test classification error rates for each classifier under various sample sizes
over 200 repetitions in Model 1 (LDA). The results are presented as mean values with the standard
deviation values in parentheses. Here, the best performing method in each column is highlighted in
bold and the methods that are within one standard deviation away are highlighted in italic.

n = 200 n = 400 n = 1000

SRaSE 12.57(1.61) 11.42(1.18) 10.78(1.02)
SRaSE1 11.44(1.36) 10.68(1.14) 10.34(0.96)
SRaSE2 11.58(1.38) 10.79(1.06) 10.37(0.95)
RaSE-LDA 12.99(1.42) 12.38(1.20) 11.16(1.10)
RaSE-QDA 13.78(1.26) 13.69(1.19) 13.23(1.04)
RaSE-KNN 13.21(1.48) 12.57(1.23) 11.19(1.10)
RaSE1 -LDA 11.48(1.26) 10.42(1.07) 10.25(1.08)
RaSE1 -QDA 11.28(1.47) 10.76(1.25) 10.4(1.22)
RaSE1 -KNN 11.11(1.25) 10.58(1.09) 10.33(1.05)
RaSE2 -LDA 12.62(1.45) 11.41(1.16) 10.13(1.03)
RaSE2 -QDA 11.74(1.45) 10.39(1.01) 10.09(1.07)
RaSE2 -KNN 11.17(1.31) 10.6(0.99) 10.34(1.02)
LDA NA(NA) 46.39(2.62) 18.62(1.53)
QDA NA(NA) NA(NA) 47.74(1.73)
KNN 29.08(2.77) 26.73(1.97) 24.53(1.60)
RF 12.66(1.43) 11.77(1.04) 11.25(1.10)

As we can see, RaSE1 -KNN performs the best when the sample size n = 200, and
RaSE2 -QDA performs the best for n = 400 and 1000. It is worth noting that the results for
LDA and QDA are NA, due to the small sample size compared with the dimension. By
inspecting the performance of the proposed Super RaSE algorithm and its iterative version,
we can see that although they are not the best performing method, both SRaSE1 and SRaSE2
are within one standard error of the best performing method, showing the robustness of
Super RaSE. Clearly, one iteration helps Super RaSE to have a lower test classification error.
In addition to the test classification error, it is useful to investigate the two by-products
of our algorithms, namely the selected proportion of each base classifier among the B1
classifiers, and the selected proportion of each feature among the weak learners that use a
particular classifier.
Let us take a look at Figure 1. The first row shows the bar charts for the selection
percentage of each base classifier in the Super RaSE algorithm among the 200 repetitions
when the sample size n varies in {200, 400, 1000}. It shows that when n = 200, the
percentage of LDA is around 50%. As the sample size increases, the percentage of LDA
also increases, showing that having a larger sample size helps us to select the model from
which the data are generated.
Now, let us look at the column of n = 1000; we can see that as the iteration process
moves on, the percentage of LDA is increasing as well, leading to almost 100% for SRaSE2 .
The second product of the Super RaSE algorithm and its iterative version is the selected
frequencies for each feature among the weaker learner with a particular classifier. Figure 2
visualizes the selected proportions of features among all the B1 classifiers that use LDA as
the base classifier. In particular, we show the selected proportions for each feature in the
minimum discriminative set S∗ = {1, 2, 5}. In the same figure, we also show a boxplot of
the selected proportion of all the noisy features as a way to verify whether the Super RaSE
algorithms can distinguish the important features from the noisy features.
From Figure 2, we observe that when n = 200, the vanilla Super RaSE algorithm
does not select the important variables 2 and 5 with a high percentage. However, the
iteration greatly helps the algorithm to increase the selected percentages for features 2 and
5. In addition, the increase in sample size leads to the selection of all important features
with almost 100% percentage. It is also worth noting that the noise features all have a
relatively small selection frequency, showing the power of feature ranking in the Super
J. Risk Financial Manag. 2021, 14, 612 9 of 18

RaSE algorithms. Similar figures can also be generated for the selected proportions of
features among the B1 classifiers that use QDA and KNN as the base classifier, respectively.
For simplicity, we omit these figures in our presentation.

200 400 1000

100

0
25

0
100

75
base
percentage

KNN
50

1
LDA
QDA
25

0
100

2
25

0
KNN LDA QDA KNN LDA QDA KNN LDA QDA
base

Figure 1. The average selected proportion for each base method for different sample sizes (corre-
sponding to each column) and iteration number (corresponding to each row) in Model 1 (LDA).

200 400 1000

100

0
25

100

75 feature
percentage

1
50
1

2
5
25

100

50
2

0
1 2 5 N 1 2 5 N 1 2 5 N
feature

Figure 2. The average selected proportion for each feature for different sample sizes (corresponding
to each column) and iteration number (corresponding to each row) in Model 1 (LDA).

3.2. Model 2: QDA

Now, we consider the case where the data are generated from a sparse QDA model
with the following setting (Fan et al. 2015; Tian and Feng 2021b).
x|y = r ∼ N (µ(r) , Σ(r) ), r = 0, 1, where Ω(0) = (Σ(0) )−1 is a p × p banded matrix
with (Ω(0) )ii = 1 and (Ω(0) )ik = 0.3 for |i − k| = 1 and all other entries zero. The inverse
covariance matrix for class 1 is Ω(1) = Ω(0) + Ω, where Ω is a p × p sparse symmet-
J. Risk Financial Manag. 2021, 14, 612 10 of 18

ric matrix with Ω10,10 = −0.3758, Ω10,30 = 0.0616, Ω10,50 = 0.2037, Ω30,30 = −0.5482,
Ω30,50 = 0.0286, Ω50,50 = −0.4614 and all other entries zero. The dimension is p = 200 and
we again vary the sample size n ∈ {200, 400, 1000}.
We can easily verify that the feature subset {1, 2, 10, 30, 50} is the minimal discrimina-
tive set S∗ . In Table 2, we present the performance of different methods for Model 2 under
different sample sizes.

Table 2. Summary of test classification error rates for each classifier under various sample sizes
over 200 repetitions in Model 2 (QDA). The results are presented as mean values with the standard
deviation values in parentheses. Here, the best performing method in each column is highlighted in
bold and the methods that are within one standard deviation away are highlighted in italic.

n = 200 n = 400 n = 1000

SRaSE 30.82(3.29) 28.68(3.23) 26.12(2.66)
SRaSE1 27.58(2.33) 24.64(1.85) 23.03(1.38)
SRaSE2 27.36(2.67) 24.04(1.74) 22.63(1.41)
RaSE-LDA 37.3(3.17) 36.11(1.97) 35.67(1.73)
RaSE-QDA 32.52(2.90) 30.44(2.60) 29(1.97)
RaSE-KNN 31.1(3.23) 27.83(2.41) 25.22(1.56)
RaSE1 -LDA 36.09(2.87) 32.82(1.74) 32.68(1.49)
RaSE1 -QDA 26.83(2.47) 25.07(1.89) 23.53(1.50)
RaSE1 -KNN 28.76(2.60) 25.88(1.98) 24.18(1.47)
RaSE2 -LDA 38.09(2.48) 33.69(1.83) 32.71(1.55)
RaSE2 -QDA 26.99(2.68) 24.87(1.99) 23.11(1.60)
RaSE2 -KNN 28.73(2.56) 25.46(1.82) 23.76(1.54)
LDA 49.03(1.94) 42.88(1.82) 38.68(1.70)
QDA NA NA 45.13(1.58)
KNN 45.67(1.78) 44.63(2.02) 43.43(1.63)
RF 37.34(2.91) 31.61(2.19) 27.42(1.60)

From the table, when n = 400 and n = 1000, SRaSE2 is the best performing method,
achieving the smallest test classification error. When n = 200, SRaSE2 is within one
standard error away from the optimal method. This result is very encouraging in the sense
that Super RaSE could improve the performance of the original RaSE coupled with the true
model from which the data are generated. This shows that the Super RaSE algorithms are
extremely robust and avoid the need to choose a base classifier as needed in the original
RaSE algorithm.
As in Model 1, we again present the average selected proportion of each base classifier
in Figure 3 as well as the average selected proportion of each feature among the B1 weak
learners with the chosen base classifier being QDA in Figure 4. In Figure 4, we also show a
boxplot of the selected proportion of all the noisy features.
In Figure 3, we again observe that as we iterate the Super RaSE algorithm, the propor-
tion of QDA greatly increases, reaching almost 100 percent for RaSE2 over all sample sizes.
This shows that the iterative Super RaSE algorithm is able to identify the true model.
In Figure 4, we can observe that the average selected proportions of each important
feature are pretty high, showing that the Super RaSE is able to pick up the important
features. In particular, when n = 1000, the iteration helps all the features to have a higher
selected proportion, reaching nearly 100 percent for SRaSE2 .
J. Risk Financial Manag. 2021, 14, 612 11 of 18

200 400 1000

100

0
25

100

75 base
percentage

KNN
50

1
LDA
QDA
25

100

2
25

0
KNN LDA QDA KNN LDA QDA KNN LDA QDA
base

Figure 3. The average selected proportion for each base method for different sample sizes (corre-
sponding to each column) and iteration number (corresponding to each row) in Model 2 (QDA).

200 400 1000

100

0
25

100

feature
75
1
percentage

2
50

1
10
30
25
50

100

50
2

0
1 2 10 30 50 N 1 2 10 30 50 N 1 2 10 30 50 N
feature

Figure 4. The average selected proportion for each feature for different sample size (corresponding
to each column) and iteration number (corresponding to each row) in Model 2 (QDA).

3.3. Model 3: KNN

Having considered two parametric models, let us move to an example where the label
is generated in a non-parametric way (Tian and Feng 2021b). We will be following the idea
of nearest neighbors when assigning the labels.
The detailed data generation process is as follows. First, 10 initial points z1 , · · · , z10
are generated i.i.d. from N (0 p×1 , I p ). Out of the 10 points, 5 are labeled as 0 and the other
5 are labeled as 1. Then, each observation ( x, y) pair is generated as follows. First, we
randomly select one element from {z1 , · · · , z10 }. Let us suppose it is zki , corresponding to
the i-th observation. Then, the corresponding yi will take the label as zk1 , and the feature
vector xi is generated as xi ∼ N ((zkT ,S∗ , 01×( p−5) ) T , 0.52 I p ). The general idea is that we are
i
J. Risk Financial Manag. 2021, 14, 612 12 of 18

generating a mixture of 10 Gaussian clusters that surround each of the {z1 , · · · , z10 } when
they are embedded in a p-dimensional space.
From the data generation process, the minimal discriminative set is S∗ = {1, 2, 3, 4, 5}.
We consider p = 200, and n ∈ {200, 400, 1000}. The summary of the test classification error
rates over 200 repetitions is presented in Table 3.

Table 3. Summary of test classification error rates for each classifier under various sample sizes
over 200 repetitions in Model 3 (KNN). The results are presented as mean values with the standard
deviation values in parentheses. Here, the best performing method in each column is highlighted in
bold and the methods that are within one standard deviation away are highlighted in italic.

n = 200 n = 400 n = 1000

SRaSE 16.89(6.10) 13.54(5.39) 10.68(4.74)
SRaSE1 8.18(4.18) 6.86(3.58) 6.12(3.24)
SRaSE2 7.22(3.82) 6.39(3.42) 5.78(3.09)
RaSE-LDA 27.38(9.53) 25.08(8.40) 24.92(9.05)
RaSE-QDA 24.24(7.20) 22.62(6.77) 22.04(6.73)
RaSE-KNN 13.26(5.03) 10.67(4.44) 8.85(4.05)
RaSE1 -LDA 25.89(10.13) 23.05(8.49) 23.42(8.95)
RaSE1 -QDA 13.83(5.88) 12.54(5.79) 12.7(5.51)
RaSE1 -KNN 7.51(3.80) 6.16(3.48) 5.9(3.12)
RaSE2 -LDA 27.49(10.13) 23.39(8.51) 23.39(9.05)
RaSE2 -QDA 13.15(5.00) 11.9(5.38) 12.15(5.22)
RaSE2 -KNN 7.06(3.62) 5.89(3.32) 5.74(3.02)
LDA 47.51(2.66) 33.27(7.65) 27.89(8.97)
QDA NA NA 36.7(4.83)
KNN 24.49(6.64) 21.04(6.60) 19.07(6.50)
RF 23.64(8.05) 17.39(6.19) 14.84(5.73)

From Table 3, we can see that Super RaSE and its iterative versions still have very com-
petitive performance compared with other methods. In particular, both SRaSE1 and SRaSE2
reach test errors that are within one standard deviation of the best performing method
RaSE2 -KNN, which uses the knowledge of the data generation process. As a method-free
algorithm, Super RaSE algorithms and their iterative versions are very appealing, having
similar performance as the best-performing ones without the need to specify which base
classifier to use.
Similar to Models 1 and 2, we again present the average selected proportion of each
base method in Figure 5. Figure 6 visualizes the selected proportions of features among all
the B1 classifiers that use KNN as the base classifier. In addition to the bar chart for the
average selected proportion of each important feature, Figure 6 also includes a boxplot of
the selected proportion of all the noisy features.
In Figure 5, we have a similar story as Figures 1 and 3. We can see that the av-
erage selection percentage of KNN is almost 100 percent for both SRaSE1 and SRaSE2 ,
showing that one step iteration is enough to almost always find the best classifier for this
particular model.
In Figure 6, we can see that without iteration, Super RaSE, on average, only selects the
five important features with around 50 percent out of the 200 repetitions. With the help of
iterations, the percentages of all five important features increase substantially, to almost
100 percent when n = 1000. This experiment shows the merit of iterative Super RaSE in
terms of capturing the important features.
J. Risk Financial Manag. 2021, 14, 612 13 of 18

200 400 1000

100

0
25

100

75 base
percentage

KNN
50

1
LDA
QDA
25

100

2
25

0
KNN LDA QDA KNN LDA QDA KNN LDA QDA
base

Figure 5. The average selected proportion for each base method for different sample sizes (corre-
sponding to each column) and iteration number (corresponding to each row) in Model 3 (KNN).

200 400 1000

100

0
25

0
100

feature
75
1
percentage

2
50

1
3
4
25
5

0
100

50
2

0
1 2 3 4 5 N 1 2 3 4 5 N 1 2 3 4 5 N
feature

Figure 6. The average selected proportion for each feature for different sample sizes (corresponding
to each column) and iteration number (corresponding to each row) in Model 3 (KNN).

4. Real Data Analysis

In this section, we evaluate the proposed Super RaSE algorithm and its iterative
version via two real data examples. Here, we compare SRaSE and SRaSE1 with the original
RaSE with the base classifiers LDA, QDA, KNN along with their one-step iterated versions.
We also add the classic LDA, QDA, KNN, and random forest (RF) to the horse race. Same
as in the simulation studies, we use the default values for all parameters in the Super RaSE
algorithm and its iterative version (B1 = 200, B2 = 500).
J. Risk Financial Manag. 2021, 14, 612 14 of 18

4.1. Mice Protein Expression

The first example we consider is the mice protein expression data set (Higuera et al.
2015). The data set is publicly available at https://archive.ics.uci.edu/ml/datasets/Mice+
Protein+Expression (accessed on 27 September 2020). There are 1080 mice in total, with
570 healthy mice (class 0) and 510 with Down’s syndrome (class 1). A total of 77 features
are measured, representing the expression of 77 different proteins. Here, we vary the size
of the training sample in {200, 500, 800}, and the remaining mice are set as the test data.
The average of test misclassification rates and standard deviations of 200 replicates
are calculated with the results reported in Table 4. When n = 200, our newly proposed
algorithm SRaSE1 achieves the lowest error among all approaches. When n = 1000, both
SRaSE and SRaSE1 are within one standard deviation from the best performing method:
RaSE1 -KNN. This shows that Super RaSE and its iterative versions are very accurate in
terms of prediction performance.

Table 4. Summary of test classification error rates for each classifier under various sample sizes over
200 repetitions for the mice protein expression data set. The results are presented as mean values
with the standard deviation values in parentheses. Here, the best performing method in each column
is highlighted in bold and the methods that are within one standard deviation away are highlighted
in italic.

n = 200 n = 500 n = 800

SRaSE 7.08(1.92) 3.33(2.54) 0.67(0.65)
SRaSE1 6.76(1.95) 2.56(2.39) 0.65(0.60)
RaSE-LDA 7.41(1.14) 5.7(0.93) 4.65(1.24)
RaSE-QDA 9.14(2.58) 4.81(1.17) 3.44(1.23)
RaSE-KNN 6.8(1.88) 1.55(0.88) 0.62(0.55)
RaSE1 -LDA 7.24(1.10) 5.53(1.02) 4.49(1.23)
RaSE1 -QDA 9.38(2.21) 5.16(1.16) 3.4(1.16)
RaSE1 -KNN 7.43(2.00) 1.7(0.87) 0.6(0.56)
LDA 7.07(1.37) 3.88(0.85) 3.13(1.08)
KNN 20.53(2.47) 7.75(1.44) 2.8(1.21)
RF 8.32(1.71) 2.62(0.94) 1.04(0.73)

Next, we show the average selected proportion for each base method for different
sample sizes and iteration numbers in Figure 7.
From this figure, we observe a very interesting phenomenon. That is when n = 200,
the selected proportion of LDA is the highest (over 75%). This choice is very reasonable
since comparing the RaSE classifier with different base classifiers, RaSE1 -LDA has the best
performance among RaSE1 combined with other base classifiers. Now, looking at the case
when n = 800, our Super RaSE and SRaSE1 both select KNN over 95% percent of the time.
Again, let us look at the RaSE classifier with a fixed base classifier; it is easy to observe
from Table 4 that RaSE1 -KNN has a much better performance than both RaSE1 -LDA and
RaSE1 -QDA. This shows that the Super RaSE algorithm is very adaptive to the specific
scenario in the sense that in each of the B1 weak learners, it automatically selects the base
classifier among the randomly selected B2 classifiers coupled with the random subspaces.
J. Risk Financial Manag. 2021, 14, 612 15 of 18

200 500 800

100

0
25

base
percentage

0 KNN
LDA
100
QDA

1
25

KNN LDA QDA KNN LDA QDA KNN LDA QDA

base

Figure 7. The average selected proportion for each base method for different sample sizes (corre-
sponding to each column) and iteration number (corresponding to each row) for the mice protein
expression data.

4.2. Hand-Written Digits Recognition

The second data set we consider is the hand-written digits recognition data set that
consists of features of hand-written numerals (0–9) extracted from a collection of Dutch
utility maps (Dua and Graff 2019). The data set is publicly available at https://archive.
ics.uci.edu/ml/datasets/Multiple+Features (accessed on 27 September 2020). Out of the
10 digits, we focus the observations corresponding to numbers 7 (class 0) and 9 (class 1).
After this preprocessing, we have 400 numbers, with 200 in class 0 and another 200 in
class 1. We again vary the training sample size, this time in {50, 100, 200}. The rest of the
observations are used as the test data. The average of the test misclassification rates along
with their standard deviations are reported in Table 5.

Table 5. Summary of test classification error rates for each classifier under various sample sizes over
200 repetitions for the hand-written digits recognition data set. The results are presented as mean
values with the standard deviation values in parentheses. Here, the best performing method in each
column is highlighted in bold and the methods that are within one standard deviation away are
highlighted in italic.

n = 50 n = 100 n = 200
SRaSE 2(1.06) 1.2(0.65) 0.78(0.48)
SRaSE1 1.78(1.03) 1.04(0.57) 0.62(0.37)
RaSE-LDA 1.56(0.85) 1.13(0.59) 0.8(0.54)
RaSE-QDA 2.5(1.47) 1.89(0.91) 1.47(0.96)
RaSE-KNN 1.86(0.96) 1.12(0.66) 0.75(0.45)
RaSE1 -LDA 1.06(0.63) 0.7(0.35) 0.53(0.40)
RaSE1 -QDA 2.18(1.66) 1.18(0.71) 0.85(0.61)
RaSE1 -KNN 1.72(0.95) 1.02(0.62) 0.6(0.44)
LDA NA 1.82(0.96) 1.01(0.56)
QDA NA NA 3.25(2.32)
KNN 1.42(1.32) 0.67(0.41) 0.6(0.47)
RF 2.34(1.24) 1.63(0.73) 1.37(0.74)
J. Risk Financial Manag. 2021, 14, 612 16 of 18

From Table 5, we can see that the Super RaSE and its iterative versions still have
competitive performance, especially when n = 100 and n = 200. SRaSE1 has a very similar
performance as the best performing method when n = 100 and n = 200. Having a careful
look, it is interesting to observe that SRaSE1 closely mimics the performance of RaSE1 -
KNN, leading us to wonder whether KNN is the most selected base method among the B1
classifiers. We confirm that this is the case by presenting the average selected proportion of
each base classifier for SRaSE and SRaSE1 in Figure 8.
From Figure 8, we can see that KNN is indeed the most selected base classifier across
all scenarios, with its selected proportion being almost 100% when n = 200.

50 100 200

0
20

base
percentage

0 KNN
LDA
QDA

1
20

KNN LDA QDA KNN LDA QDA KNN LDA QDA

base

Figure 8. The average selected proportion for each base method for different sample sizes (corre-
sponding to each column) and iteration number (corresponding to each row) for the hand-written
digits recognition data.

5. Discussion
In this work, motivated by the random subspace ensemble (RaSE) classification,
we propose a new ensemble classification framework, named Super RaSE, which is a
completely model-free approach. While RaSE has to be paired with a single base classifier,
Super RaSE can work with a collection of base classifiers, making it more flexible and
robust in applications. In particular, we recommend that the practitioners include several
different base classifiers, including parametric and non-parametric ones, so as to make the
Super RaSE framework more powerful. It is worth noting that as we include more base
classifiers in the Super RaSE algorithm, perhaps larger values of B1 and B2 are needed so
as to generate a sufficient number of random subspace and base classifier pairs among
the B2 classifiers, and reach a more stable result with a larger B1 . By randomly generating
the pair of base classifier and subspace, it will automatically select a collection of good
base classifier and subspace pairs. Besides the superb prediction performance on the test
data, the Super RaSE algorithms also provide two important by-products. The first one
is a quality measure of each base classifier, which can provide insight as to which base
classifiers are more appropriate for the given data. The second by-product is that, given
each base classifier, the Super RaSE will also generate an important measure for each
feature, which can be used for feature screening and ranking (Fan et al. 2011; Saldana and
Feng 2018; Yousuf and Feng 2021), just like the original RaSE (Tian and Feng 2021a). The
feature important measure can provide an interpretable machine learning model, which is
very useful in understanding how the method works.
J. Risk Financial Manag. 2021, 14, 612 17 of 18

There are many possible future research directions. First, this paper only considers
the binary classification problem, but there may be more than two classes in applications.
How to extend the Super RaSE algorithm to the multiclass situation is an important topic.
In particular, we need to find multiclass classification algorithms as well as a new decision
function when we ensemble the B1 weak learners. In particular, the original thresholding
step in line 9 of Algorithm 1 is no longer applicable. Second, in addition to the classification
problem, it is also worthwhile to study the corresponding algorithm under a regression
setting. To achieve this, we can change the prediction step from thresholding to a simple
averaging. Third, using the selected proportion of features, it is possible to develop a
variable selection or variable screening algorithm. For example, we can set a threshold for
variable screening which is expected to capture important interaction effects in addition
to the marginal ones. The interaction detection problem has been well studied in high-
dimensional quadratic regression (Hao et al. 2018). However, the Super RaSE will greatly
expand the scope within which the interaction selection works, allowing for the interaction
among features to be of different formats. Lastly, in both RaSE and Super RaSE, we use
all the observations in each combination of the base classifier and a random subspace. We
know that random forest is viewed as a very robust classifier, partly due to the fact that
it uses a bagging sample (Breiman 1996) for each tree during the ensemble process. This
particular bagging step makes the trees uncorrected, which reduces the variable of the final
classifier. A natural extension of Super RaSE is Bagging Super RaSE, which uses a bagging
sample for each of the B1 groups. It is worth noting that it may not be a good idea to use
a different bagging sample for each of the B2 pairs of a base classifier and an associated
random subspace, due to the fact that it could lead to an unfair comparison among the
B2 pairs.

Author Contributions: Conceptualization, Y.F.; methodology, J.Z. and Y.F.; software, J.Z. and Y.F.;
writing—original draft preparation, J.Z. and Y.F.; writing—review and editing, J.Z. and Y.F. All
authors have read and agreed to the published version of the manuscript.
Funding: Feng was partially supported by the NSF CAREER Grant DMS-2013789 and NIH Grant
1R21AG074205-01.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Acknowledgments: The authors thank the editor Syed Ejaz Ahmed, the associate editor, and the
three reviewers for providing many constructive comments that greatly improved the quality and
scope of the paper.
Conflicts of Interest: The authors declare no conflict of interest.

References
Blaser, Rico, and Piotr Fryzlewicz. 2016. Random rotation ensembles. The Journal of Machine Learning Research 17: 126–51.
Breiman, Leo. 1996. Bagging predictors. Machine Learning 24: 123–40. [CrossRef]
Breiman, Leo. 2001. Random forests. Machine Learning 45: 5–32. [CrossRef]
Breiman, Leo, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. 2017. Classification and Regression Trees. Boca Raton:
Routledge.
Cannings, Timothy I., and Richard J. Samworth. 2017. Random-projection ensemble classification. Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 79: 959–1035. [CrossRef]
Dainotti, Alberto, Antonio Pescape, and Kimberly C. Claffy. 2012. Issues and future directions in traffic classification. IEEE Network 26:
35–40. [CrossRef]
Dietterich, Thomas G. 2000. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems. New York:
Springer, pp. 1–15.
Dua, Dheeru, and Casey Graff. 2019. Uci Machine Learning Repository. Irvine: School of Information and Computer Science, University
of California.
Durrant, Robert J., and Ata Kabán. 2015. Random projections as regularizers: Learning a linear discriminant from fewer observations
than dimensions. Machine Learning 99: 257–86. [CrossRef]
J. Risk Financial Manag. 2021, 14, 612 18 of 18

Dvorsky, Jan, Jaroslav Belas, Beata Gavurova, and Tomas Brabenec. 2021. Business risk management in the context of small and
medium-sized enterprises. Economic Research-Ekonomska Istraživanja 34: 1690–708. [CrossRef]
Fan, Jianqing, Yang Feng, Jiancheng Jiang, and Xin Tong. 2016. Feature augmentation via nonparametrics and selection (fans) in
high-dimensional classification. Journal of the American Statistical Association 111: 275–87. [CrossRef] [PubMed]
Fan, Jianqing, Yang Feng, and Rui Song. 2011. Nonparametric independence screening in sparse ultra-high-dimensional additive
models. Journal of the American Statistical Association 106: 544–57. [CrossRef]
Fan, Jianqing, Yang Feng, and Xin Tong. 2012. A road to classification in high dimensional space: The regularized optimal affine
discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74: 745–71. [CrossRef]
Fan, Yingying, Yinfei Kong, Daoji Li, and Zemin Zheng. 2015. Innovated interaction screening for high-dimensional nonlinear
classification. The Annals of Statistics 43: 1243–72. [CrossRef]
Feng, Yang, and Qingfeng Liu. 2020. Nested model averaging on solution path for high-dimensional linear regression. Stat 9: e317.
[CrossRef]
Feng, Yang, Qingfeng Liu, Qingsong Yao, and Guoqing Zhao. 2021a. Model averaging for nonlinear regression models. Journal of
Business & Economic Statistics 1–14. [CrossRef]
Feng, Yang, Xin Tong, and Weining Xin. 2021b. Targeted Crisis Risk Control: A Neyman-Pearson Approach. Available online:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3945980 (accessed on 27 September 2020).
Fix, Evelyn, and Joseph Lawson Hodges. 1989. Discriminatory analysis. nonparametric discrimination: Consistency properties.
International Statistical Review/Revue Internationale de Statistique 57: 238–47. [CrossRef]
Gao, Xiaoli, S. E. Ahmed, and Yang Feng. 2017. Post selection shrinkage estimation for high-dimensional data analysis. Applied
Stochastic Models in Business and Industry 33: 97–120. [CrossRef]
Hao, Ning, Yang Feng, and Hao Helen Zhang. 2018. Model selection for high-dimensional quadratic regression via regularization.
Journal of the American Statistical Association 113: 615–25. [CrossRef]
Higuera, Clara, Katheleen J. Gardiner, and Krzysztof J. Cios. 2015. Self-organizing feature maps identify proteins critical to learning in
a mouse model of down syndrome. PLoS ONE 10: e0129126. [CrossRef]
Ho, Tin Kam. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine
Intelligence 20: 832–44.
Jurgovsky, Johannes, Michael Granitzer, Konstantin Ziegler, Sylvie Calabretto, Pierre-Edouard Portier, Liyun He-Guelton, and Olivier
Caelen. 2018. Sequence classification for credit-card fraud detection. Expert Systems with Applications 100: 234–45. [CrossRef]
Kotsiantis, Sotiris B., I. Zaharakis, and P. Pintelas. 2007. Supervised machine learning: A review of classification techniques. Emerging
Artificial Intelligence Applications in Computer Engineering 160: 3–24.
Mai, Qing, Hui Zou, and Ming Yuan. 2012. A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99:
29–42. [CrossRef]
Michalski, Grzegorz, Małgorzata Rutkowska-Podołowska, and Adam Sulich. 2018. Remodeling of fliem: The cash management in
polish small and medium firms with full operating cycle in various business environments. In Efficiency in Business and Economics.
Berlin/Heidelberg: Springer, pp. 119–32.
Raftery, Adrian E., David Madigan, and Jennifer A. Hoeting. 1997. Bayesian model averaging for linear regression models. Journal of
the American Statistical Association 92: 179–91. [CrossRef]
Rokach, Lior. 2010. Ensemble-based classifiers. Artificial Intelligence Review 33: 1–39. [CrossRef]
Saldana, Diego Franco, and Yang Feng. 2018. Sis: An r package for sure independence screening in ultrahigh-dimensional statistical
models. Journal of Statistical Software 83: 1–25. [CrossRef]
Steinwart, Ingo, and Andreas Christmann. 2008. Support Vector Machines. Berlin/Heidelberg: Springer Science & Business Media.
Szczygieł, N., M. Rutkowska-Podolska, and Grzegorz Michalski. 2014. Information and communication technologies in healthcare:
Still innovation or reality? Innovative and entrepreneurial value-creating approach in healthcare management. Paper presented
at the 5th Central European Conference in Regional Science, Košice, Slovakia, October 5–8. pp. 1020–29.
Tian, Ye, and Yang Feng. 2021a. Rase: A variable screening framework via random subspace ensembles. Journal of the American
Statistical Association 1–30, accepted. [CrossRef]
Tian, Ye, and Yang Feng. 2021b. Rase: Random subspace ensemble classification. J. Mach. Learn. Res. 22: 1–93.
Tong, Xin, Yang Feng, and Jingyi Jessica Li. 2018. Neyman-pearson classification algorithms and np receiver operating characteristics.
Science Advances 4: eaao1659. [CrossRef]
Van der Laan, Mark J., Eric C. Polley, and Alan E. Hubbard. 2007. Super learner. Statistical Applications in Genetics and Molecular
Biology 6. [CrossRef] [PubMed]
Yousuf, Kashif, and Yang Feng. 2021. Targeting predictors via partial distance correlation with applications to financial forecasting.
Journal of Business & Economic Statistics 1–13. [CrossRef]

Project
100% (1)
Project
11 pages
A Two-Stage Optimized Robust Kernel Density Estima
No ratings yet
A Two-Stage Optimized Robust Kernel Density Estima
36 pages
Interplay between dataset characteristics and oversampling methods
No ratings yet
Interplay between dataset characteristics and oversampling methods
23 pages
Novos Classificadores
No ratings yet
Novos Classificadores
38 pages
IT 802 ML Unit-2 Notes
No ratings yet
IT 802 ML Unit-2 Notes
19 pages
2.feb-2020 A Multiple Classifiers System For Anomaly
No ratings yet
2.feb-2020 A Multiple Classifiers System For Anomaly
12 pages
Algorithms 13 00140
No ratings yet
Algorithms 13 00140
4 pages
A Generic Framework For Rule-Based Classification
No ratings yet
A Generic Framework For Rule-Based Classification
18 pages
Probabilistic Feature Selection and Classification Vector Machine
No ratings yet
Probabilistic Feature Selection and Classification Vector Machine
27 pages
PAACDA Comprehensive Data Corruption Detection Algorithm
No ratings yet
PAACDA Comprehensive Data Corruption Detection Algorithm
8 pages
Machine Learning Notes ?
No ratings yet
Machine Learning Notes ?
14 pages
Finding Rare Classes: Active Learning With Generative and Discriminative Models
No ratings yet
Finding Rare Classes: Active Learning With Generative and Discriminative Models
13 pages
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
No ratings yet
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
25 pages
Supervised Learning
No ratings yet
Supervised Learning
3 pages
Vol 8 No 0103
No ratings yet
Vol 8 No 0103
5 pages
Rtrpgit
No ratings yet
Rtrpgit
14 pages
(IJCST-V10I4P14) :manish Chava, Aman Agarwal, DR Radha K
No ratings yet
(IJCST-V10I4P14) :manish Chava, Aman Agarwal, DR Radha K
5 pages
Machine Learning Midterm
No ratings yet
Machine Learning Midterm
18 pages
ML and Ai Unit 04 and Unit 05
No ratings yet
ML and Ai Unit 04 and Unit 05
58 pages
Electric Power Scam Prediction Using Machine Learning Techniques
No ratings yet
Electric Power Scam Prediction Using Machine Learning Techniques
8 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
Artigo Smallex
No ratings yet
Artigo Smallex
17 pages
Between Classification-Error Approximation and Weighted Least-Squares Learning
No ratings yet
Between Classification-Error Approximation and Weighted Least-Squares Learning
12 pages
Data Science in FInancial Services _3
No ratings yet
Data Science in FInancial Services _3
76 pages
Feature-Subspace Aggregating: Ensembles For Stable and Unstable Learners
No ratings yet
Feature-Subspace Aggregating: Ensembles For Stable and Unstable Learners
23 pages
Lecture Notes - SVM
No ratings yet
Lecture Notes - SVM
13 pages
Kernel Ridge Regression Classification
No ratings yet
Kernel Ridge Regression Classification
5 pages
Enhanced synthetic oversampling for multiclass imbalanced data
No ratings yet
Enhanced synthetic oversampling for multiclass imbalanced data
20 pages
DSBDUNITIII_T1729232981820-1
No ratings yet
DSBDUNITIII_T1729232981820-1
26 pages
1.1 What Is Ensemble Learning: 1.2 Model Selection
No ratings yet
1.1 What Is Ensemble Learning: 1.2 Model Selection
18 pages
1 s2.0 S0167404816301572 Main
No ratings yet
1 s2.0 S0167404816301572 Main
18 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Platt, Large Margin DAGs for Multi-class Classification
No ratings yet
Platt, Large Margin DAGs for Multi-class Classification
8 pages
Paper IJRITCC
No ratings yet
Paper IJRITCC
5 pages
Va41 3 405
No ratings yet
Va41 3 405
6 pages
Anomaly Detection 2
No ratings yet
Anomaly Detection 2
8 pages
Applsci 13 04852 v2
No ratings yet
Applsci 13 04852 v2
18 pages
AML Unit 4 Part 1
No ratings yet
AML Unit 4 Part 1
14 pages
Data Mining Versus Statistical Tools For Value at Risk Estimation
No ratings yet
Data Mining Versus Statistical Tools For Value at Risk Estimation
9 pages
Paper 6 - 240417 - 184500 OCR
No ratings yet
Paper 6 - 240417 - 184500 OCR
11 pages
An Introductory Note On Machine Learning. A V Narasimhadhan
No ratings yet
An Introductory Note On Machine Learning. A V Narasimhadhan
2 pages
A XGB B - : Oost Risk Model Via Feature Selection AND Ayesian Hyper Parameter Optimization
No ratings yet
A XGB B - : Oost Risk Model Via Feature Selection AND Ayesian Hyper Parameter Optimization
17 pages
1501589527da-mod14-Q1-e-text
No ratings yet
1501589527da-mod14-Q1-e-text
12 pages
Ensemble Learning
No ratings yet
Ensemble Learning
15 pages
15 dm2 Imbalanced Learning 2022 23
No ratings yet
15 dm2 Imbalanced Learning 2022 23
35 pages
2
No ratings yet
2
15 pages
Clustering Before Classification
No ratings yet
Clustering Before Classification
3 pages
ambigu2
No ratings yet
ambigu2
6 pages
Training and Assessing Classification Rules With u
No ratings yet
Training and Assessing Classification Rules With u
29 pages
1-s2.0-S0957417423032803-main
No ratings yet
1-s2.0-S0957417423032803-main
29 pages
KRAWXZYKINFFUS2017
No ratings yet
KRAWXZYKINFFUS2017
86 pages
8ad59658 1701235711480
No ratings yet
8ad59658 1701235711480
36 pages
Li 2011
No ratings yet
Li 2011
4 pages
IET Communications - 2021 - Le - A Comprehensive Survey of Imbalanced Learning Methods For Bankruptcy Prediction
No ratings yet
IET Communications - 2021 - Le - A Comprehensive Survey of Imbalanced Learning Methods For Bankruptcy Prediction
9 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Classification
No ratings yet
Classification
10 pages
Artificial Intelligence and Machine Learning: T.A. Silvia Bucci
No ratings yet
Artificial Intelligence and Machine Learning: T.A. Silvia Bucci
78 pages
1041_anomaly_detection_for_tabular_
No ratings yet
1041_anomaly_detection_for_tabular_
26 pages
Assignment 4 Reportdocx
No ratings yet
Assignment 4 Reportdocx
10 pages
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Tutorial:: Relationship Between Internal Audit and Risk Management
No ratings yet
Tutorial:: Relationship Between Internal Audit and Risk Management
6 pages
NAIM BAS Invitation II-1
100% (1)
NAIM BAS Invitation II-1
3 pages
Utilizing an adaptable artificial intelligence writing tool (ChatGPT) to enhance academic writing skills among Yemeni university EFL students
No ratings yet
Utilizing an adaptable artificial intelligence writing tool (ChatGPT) to enhance academic writing skills among Yemeni university EFL students
9 pages
2084-Article Text-5594-1-10-20220804
No ratings yet
2084-Article Text-5594-1-10-20220804
8 pages
HRD Chap 4: Assessing HRD Needs
No ratings yet
HRD Chap 4: Assessing HRD Needs
20 pages
Pipeline Integrity Management
100% (1)
Pipeline Integrity Management
33 pages
The Main Principles of Total Quality Management: Andrei Diamandescu
No ratings yet
The Main Principles of Total Quality Management: Andrei Diamandescu
7 pages
Math 15 Module 4 - Activity 4 - Gayta
No ratings yet
Math 15 Module 4 - Activity 4 - Gayta
5 pages
Color Television Industry
No ratings yet
Color Television Industry
56 pages
Tina Gaumond Thesis
100% (3)
Tina Gaumond Thesis
4 pages
Prospectus
No ratings yet
Prospectus
9 pages
Project Planning & Control Prof. Koshy Varghese Department of Civil Engineering Indian Institute of Technology, Madras
No ratings yet
Project Planning & Control Prof. Koshy Varghese Department of Civil Engineering Indian Institute of Technology, Madras
6 pages
Product Differntiation Telecommunication
No ratings yet
Product Differntiation Telecommunication
13 pages
26.10 Writing - Mr Huy
No ratings yet
26.10 Writing - Mr Huy
10 pages
Project Second Term 2024
No ratings yet
Project Second Term 2024
6 pages
Project Cost Management
No ratings yet
Project Cost Management
28 pages
Thesis Related To Community Health Nursing
100% (2)
Thesis Related To Community Health Nursing
8 pages
Servqual Model
No ratings yet
Servqual Model
16 pages
Basic Principles of GMP
No ratings yet
Basic Principles of GMP
3 pages
Rasheed Talbot Mathematics SBA
No ratings yet
Rasheed Talbot Mathematics SBA
11 pages
206 Course Guide
100% (1)
206 Course Guide
23 pages
SSRN Id3846839
No ratings yet
SSRN Id3846839
6 pages
Untitled
No ratings yet
Untitled
34 pages
Hazard Identification Risk Assessment HIRA 1746460815
No ratings yet
Hazard Identification Risk Assessment HIRA 1746460815
38 pages
Format of A Marketing Plan
No ratings yet
Format of A Marketing Plan
3 pages
Artificial Intelligence - Entering The World of Tax: October 2019
No ratings yet
Artificial Intelligence - Entering The World of Tax: October 2019
8 pages
The Impact of Tax Audit Practices On Revenue Generation in Nigeria
No ratings yet
The Impact of Tax Audit Practices On Revenue Generation in Nigeria
9 pages
Data Collection
67% (3)
Data Collection
15 pages

Super_RaSE_Super_Random_Subspace_Ensemble_Classifi

Uploaded by

Super_RaSE_Super_Random_Subspace_Ensemble_Classifi

Uploaded by

Journal of

Risk and Financial

Citation: Zhu, Jianan, and Yang

J. Risk Financial Manag. 2021, 14, 612. https://doi.org/10.3390/jrfm14120612 https://www.mdpi.com/journal/jrfm

x|y = y0 ∼ (1 − y0 ) f (0) + y0 f (1) , y0 = 0, 1,

2.1. Super Random Subspace Ensemble Classification (SRaSE)

candidate base classifier set T = { T1 , · · · , TM }. By default, we will be using a uniform

Algorithm 1: Super random subspace ensemble classification (SRaSE)

2.2. Iterative Super RaSE

Algorithm 2: Iterative Super RaSE (SRaSET )

2.3. Parameter Specification

3.1. Model 1: LDA

n = 200 n = 400 n = 1000

200 400 1000

200 400 1000

3.2. Model 2: QDA

n = 200 n = 400 n = 1000

200 400 1000

200 400 1000

3.3. Model 3: KNN

n = 200 n = 400 n = 1000

200 400 1000

200 400 1000

4. Real Data Analysis

4.1. Mice Protein Expression

n = 200 n = 500 n = 800

200 500 800

KNN LDA QDA KNN LDA QDA KNN LDA QDA

4.2. Hand-Written Digits Recognition

KNN LDA QDA KNN LDA QDA KNN LDA QDA

You might also like