A Novel Intuitionistic Fuzzy Rough Instance Selection and A - 2024 - Expert Syst
A Novel Intuitionistic Fuzzy Rough Instance Selection and A - 2024 - Expert Syst
Keywords: Due to advancement of internet and lab based technologies, large volume of high dimensional data are
Kernelized fuzzy C-means generated every day. These data usually consisted of several issues such as class imbalance, noise, later
Intuitionistic fuzzy set uncertainty, irrelevant and/or redundant features, and redundancy in size. These issues degrade the overall
Rough set
performance measures of the various machine learning algorithms. An efficient method to cope with such issues
Instance selection
for a large sized datasets is to apply efficient data reduction techniques. In the recent years, numerous data
Feature selection
SMOTE
reduction techniques have been presented based on fuzzy rough set theory to tackle these obstacles. However,
building such a method that can handle all the above mentioned issues simultaneously is still a challenging
task. In this study, we handle all these obstacles simultaneously by introducing a novel approach to obtain the
reduced dataset by combining kernelized intuitionistic fuzzy C-means with intuitionistic fuzzy rough set model.
Intuitionistic fuzzy rough set handles the vagueness and uncertainty in a better way than fuzzy set aided models
as it takes use of membership, non-membership as well as hesitancy to capture the uncertainty of real-valued
datasets. To generate the membership and non-membership grades, kernelized intuitionistic fuzzy C- means
based notion is introduced. Further, an intuitionistic fuzzy rough set model is established by addressing lower
and upper approximations based on a novel similarity relation. Moreover, all the necessary conditions are
justified with the relevant mathematical theorems. Next, this model is employed for dimensionality reduction
based on the concept of discernibility matrix by using the idea of different classes’ ratios to avoid the noise.
Thereafter, positive region is defined with the help of lower approximation. The positive region information is
applied to tackle problematic instances available in both minority and majority class for the imbalanced dataset
after generation of artificial samples by synthetic minority oversampling technique (SMOTE). A comprehensive
experimental study is added to show the effectiveness of the proposed technique. Finally, a framework is
established to improve the prediction of animal toxin peptides.
1. Introduction algorithms (Li et al., 2017). Moreover, this usually leads to the circum-
stances, where the ratio of increase in the amounts of generated data to
1.1. Background of the study the growth of possible tools for data analysis is extremely huge. Dealing
with such data carries great challenges to data mining, bioinformatics,
In today’s world, data is becoming a very integral part with the machine learning, signal processing, and pattern recognition. Nonethe-
development in computer engineering, internet applications, and lab- less, it is still an interesting requirement that datasets can contain as
oratory technologies (Chen, Mi, & Lin, 2020). With time the size much as possible information, and hence all the possible informative
of datasets are exploding, and to handle such large volume datasets features and samples are essential to be added to our dataset (Ji et al.,
require a lot of time, expense and computational power, which of- 2021).
ten is extremely challenging to be handled by current hardware and
∗ Corresponding author.
E-mail addresses: [email protected] (A.K. Tiwari), [email protected] (A. Nath), [email protected] (R.K. Pandey), [email protected]
(P. Maratha).
https://doi.org/10.1016/j.eswa.2024.124087
Received 6 June 2023; Received in revised form 18 April 2024; Accepted 20 April 2024
Available online 25 April 2024
0957-4174/© 2024 Published by Elsevier Ltd.
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
1.2. Research challenges Sugeno, et al., 2000). Fuzzy C-Means (Bezdek, Ehrlich, & Full, 1984)
has been effectively applied to compute the membership values from
There are drawbacks of having a lot of features, which may result the dataset (Suganya & Shanthi, 2012). Further, Fuzzy C-means was ex-
in numerous issues such as vagueness, uncertainty, redundancy, noise, tended to use a kernelized fuzzy C-means by using kernel function (Liu
and irrelevancy (Yang, Chen, Li & Luo, 2022). These issues may lead & Xu, 2008). Here, dataset which exists in low dimension space can
the model to be trained to discover incomplete or false patterns. The be mapped to a higher dimensional space using inner product, which
approach to fix ‘curse of dimensionality’ is to find the relevant and is defined as the transformation of non-linear problem to a linear
non-redundant features by avoiding noise, vagueness, and uncertainty problem. This helps in solving the problem rather easily as it can easily
during feature reduction (Jain, Tiwari, & Som, 2020). On the other handle uncertainty at two levels by averting noise. However, kernelized
hand, redundant size of the dataset always leads to overfitting issues. intuitionistic fuzzy c-means notion was proved to handle the noise
Redundant samples as well as redundant and/or irrelevant features and uncertainty in a much effective way when compared to kernelized
with above mentioned issues always degrade the performance measures fuzzy c-means idea. In the recent years, few research articles have
of the learning algorithms. presented effective methods to deal with redundant size and redundant
and/or irrelevant dimensions available in high-dimensional imbalanced
1.3. The motivation for the study datasets. Synthetic minority optimizing technique with fuzzy rough set
theory (SMOTE-FRST) (Ramentol et al., 2012) uses one threshold to
It is always demanding to tackle both size and dimension during eliminate unsuitable size. But, this has been proved inadequate for the
data reduction process to avoid the misclassifications and degrade instance elimination, for example, if one takes a threshold 𝛾 as 0.7 in
in the performances of learning algorithms. The main aim of data HCVB problem most of the majority instances get removed. And if we
reduction process is to reduce the original data by choosing the most lower this 𝛾 threshold, then many unsuitable minority class examples
representative information. The manifest benefits of datasets reduction are retained and hence the quality of the synthesized examples is also
emphasis on lessening time complexity and excessive storage. Further, decreased. Although it reduced the model quality as the instances are
the streamlined methods could be established by the reduced datasets, synthesized but it should be as close as possible to the minority class. To
and the analysis findings can be effectively improved. Instance selection handle the learning with imbalanced data consisted of such problems,
can be performed to obtain the representative instances by eliminating SMOTE-FRST-2T (Ramentol et al., 2016) was introduced by using two
redundant data points, irrelevant instances or errors from the orig- thresholds with SMOTE-FRST. It first populates the minority class and
inal data, which offers to mine critical information and efficiently then uses two thresholds to remove irrelevant instances of synthetic
obtain high quality outcomes with less processing time specifically for minority class and majority class. It uses two different thresholds
the large-scale datasets. Moreover, attribute selection method is em- for eliminating examples from training set. However, this method is
ployed to retain the significance and semantics of original features (Qiu insufficient to deal with later uncertainty, where uncertainties emerge
& Zhao, 2022), which offers better pattern recognition ability, and due to both judgement as well as identification as it was based on fuzzy
improves the learning tasks.
set theory.
The concept of intuitionistic fuzzy sets was proposed by Atanassov
1.4. The content of the study
and Atanassov (1999), where non-membership and hesitancy degree
were computed along with membership values to deal with later uncer-
Rough set abetted approaches were efficaciously implemented to
tainty. Intuitionistic fuzzy set deals with the uncertainty by using three
perform data reduction. Rough set theory was taken into consideration
functions namely membership, non-membership, and hesitancy. Thus,
to identify a minimal representation of the datasets from different
it handles uncertainty in far better way than fuzzy set. Approximations
domains. This concept was efficaciously applied to solve both instance
are defined over intuitionistic fuzzy framework (Cornelis, De Cock,
and feature selection problems. Positive and boundary region notions
& Kerre, 2003) to present intuitionistic fuzzy rough set model. This
were implemented for discarding redundant and irrelevant samples.
method has been proved to be successful method in handling the
Various feature selection techniques were discussed based on discerni-
uncertainty available for real world problems. By combining rough
bility matrix and dependency function. Dependency function assisted
set with intuitionistic fuzzy set, few instance and feature selection
idea was found to be more effective when compared to discernibility
methods were introduced (Olvera-López, Carrasco-Ochoa, Martínez-
matrix based notion as it produces unique reduct sets. In rough set
Trinidad, & Kittler, 2010; Tiwari, Shreevastava, Som, & Shukla, 2018).
model, an equivalence relation is incorporated over a crisp set. Instance
Moreover, Jain, Tiwari, and Som (2022) established a bireduct method
selection and feature selection were successfully performed with rough
to handle intuitionistic fuzzy information systems. However, none of
set theory. Here, dependency function was used to find dependency
the research articles have attempted to cope with imbalanced datasets
between conditional and decision attributes (Jensen & Shen, 2008).
in the intuitionistic fuzzy framework.
Using this dependency function, the reduct or minimal subset can be
found for a dataset (Thangavel & Pethalakshmi, 2009). But, this concept
can handle only discrete or quantitative feature values, which causes 1.5. Contribution of the study
smaller to higher information loss. This issue was resolved by using
fuzzy rough set based approaches. Fuzzy rough set based approaches This study deals with the different issues related to large vol-
have handled real world datasets directly without discretization by ume of imbalanced datasets. Initially, a new kernelized intuitionistic
combining fuzzy (Zadeh, 1965) and rough sets (Pawlak, 1982), where fuzzification for real-valued dataset is discussed. Then, a novel similar-
uncertainty and vagueness were dealt with Dubois and Prade (1992), ity relation is incorporated to establish an intuitionistic fuzzy rough
Jensen (2008) and Jensen and Shen (2004). In fuzzy rough set theory, instance and feature selection processes by using the notion of dif-
instead of crisp relation, fuzzy similarity relations were applied over ferent classes’ ratio. The proposed models are justified by using both
real valued attributes. The approximations (lower and upper) were then mathematical theorems and experimental results. A comprehensive
defined based on this fuzzy similarity relation. Thus, the attributes comparative study with existing models proves the dominance of our
need not to be discretized prior to attribute reduction (Wang, Qian, techniques. Current study handles class imbalance issue with the help
Ding, & Fan, 2021; Yang, Chen, Li, Zhang & and Luo, 2022). Fuzzy of SMOTE. The idea of different classes’ ratio is employed to tackle
similarity relations are used to calculate the membership grades and noise available in the data. Uncertainty due to both judgement and
then the approximations. More fuzziness indicates that datasets hold identification is coped by incorporating intuitionistic fuzzy set theory.
better information (Mac Parthaláin, Jensen, & Diao, 2019; Murofushi, Irrelevancy and redundancy in the features along with redundant size
2
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
is handled by using feature and instance selection methods. Intuition- experimental approaches. Therefore, methods are required to improve
istic fuzzification is the preliminary requirement of this study as the the prediction of animal toxin and non-toxin. Discrimination of animal
proposed approach can be applied to intuitionistic fuzzy data only. toxin and non-toxin has been enhanced by applying the discussed data
Kernelized intuitionistic fuzzy c-means is employed for the intuitionistic reduction technique, the reported results are better when compared to
fuzzification of the real-valued datasets. results by the previous studies. Consequently, we can claim that our
Moreover, an interesting imbalanced dataset problem is identified proposed method performance better when compared to the existing
for animal toxin prediction. The animal toxin proteins are described data reduction techniques till date.
as one of the disulfide rich small peptides that perceived in well-
known venomous species. They are employed as therapeutic agents and 2. Literature survey
pharmacological tools in the field of medicine to achieve the elevated
specificity and potency in regard to their targets. An effective method Fuzzy rough set based techniques have been effectively imple-
for analysis and prediction of animal toxin proteins is always found mented to perform simultaneous elimination of instances and attributes
to be essential for the therapeutic, drug discovery and pharmacolog- for the real-valued datasets. The currently available fuzzy-rough-set
ical researches. Biological experimental approaches are reported as oriented feature reduction techniques can be easily addressed into
resource and time consuming for the identification of animal toxin three comprehensive categories: discernibility matrix techniques (Dai,
peptides. Moreover, different machine learning methods are presented Hu, Wu, Qian, & Huang, 2017), fuzzy dependency function oriented
to reduce the cost and time for the prediction of animal toxin peptides. algorithmic methods, and fuzzy data entropy oriented algorithmic ap-
However, an efficient computational method is still required to improve proaches (Jensen, 2008; Zhang, Mei, Chen, & Li, 2016). In the dis-
the prediction of animal toxin peptides. In the current study, a suitable cernibility matrix aided idea, those samples are considered where the
framework is given to enhance the prediction of animal toxin peptides, discernibilities occur due to differentiation in decision features. In the
which performs better results than all the earlier reported results. dependency function based approaches, the degree of dependency of
decision feature over a collection of conditional features is computed.
Highlights of the contribution In entropy based feature selection, granular structure of the fuzzy
lower approximation was employed to produce the reduct sets by
(a) Kernelized Fuzzy C-Means is used to generate membership for using different fuzzy similarity relations. In particular, Derrac, Cornelis,
the real valued datasets. Then, non-membership and hesitancy García, and Herrera (2012) proposed an approach based on genetic
are calculated by using Sugeno negation function (Kumar, Verma, algorithm in a steady state which is incorporated into an attribute
Mehra, & Agrawal, 2019; Murofushi et al., 2000). selection procedure which employs a fuzzy-rough combination. There
(b) A new intuitionistic fuzzy relation based intuitionistic fuzzy rough is a significant amount of work is available for choosing the attributes
feature selection is presented based on the idea of different using fuzzy rough sets, but very less work is existing for selection of
classes’ ratio by using the concept of discernibility matrix. data points (instances). An early examination on data points selection
(c) We apply an extension of SMOTE-FRST-2T as synthetic minor- using fuzzy rough sets (Jensen & Cornelis, 2010) that chooses the
ity optimizing technique with intuitionistic fuzzy rough instance data points whose membership in the fuzzy positive area does not fall
selection (SMOTE-IFRST-2T) based on two thresholds by consid- below a predetermined limit. A fuzzy rough criterion was utilized to
ering intuitionistic fuzzy cardinality to clean problematic sam- characterize the worth of the data points, and a wrapper technique
ples available in both minority and majority class of imbalanced was put forward to identify the chosen data points, according to the
datasets. fuzzy rough model approach proposed by Verbiest, Cornelis, and Her-
(d) A comprehensive experimental study is conducted to show the rera (2013). To pick appropriate instances for the K-nearest neighbor
superiority of the proposed technique. rule, Tsang, Hu, and Chen (2016) developed an unbiased sampling
(e) Friedman (Friedman, 1940) and Dunn tests (Dunn, 1961) are method. In order to find an appropriate data points set based on the
incorporated to justify the dominance of the proposed method discriminating capacity of the fuzzy granularity rules in a fuzzy decision
over existing methods. framework, Zhang, Mei, Chen, and Yang (2018) subsequently suggested
(f) Finally, a framework is introduced based on the above proposed a fuzz-rough-set technique. However, this approach does not take into
methods to discriminate the animal toxin and non-toxin peptides account the possibility of eliminating noise in the primary data. So, it
as presented in Fig. 1. is possible for the noise to penetrate into the resulting representative
data points set. In addition, using fuzzy rough set aided concept, Zhang,
Mei, Chen, Yang, and Li (2019) suggested a technique to choose a
1.6. The results of the study sample of data points from a forthcoming data points set in an evolving
settings, which then used the newly arrived symbolic data points set
This method has successfully discarded noise and later uncertainty instead of the entire forthcoming data points set to perform gradual
due to utilization of intuitionistic fuzzy set and different classes’ ratio selection of attributes in order to minimize estimation both in terms of
along with parameterized concepts. Noise and uncertainty due to both time and complexity. It is important to note that some fuzzy rough-
judgement and identification have been handled at two levels. One set-based researches on synchronous attribute and selection of data
at the time of intuitionistic fuzzification and another at the time of points concentrate predominantly on sophisticated optimal strategies.
data reduction, which leads the great improvement in the performance Sophisticated optimization method describes a type of optimization
measures of learning algorithms during prediction. As the imbalanced method that models natural occurrences and patterns through iterations
datasets are very much common in the current era of technology, this according to population (Jain et al., 2022). Anaraki, Samet, Lee, and
method can perform effective results to improve the prediction in the Ahn (2015) suggested an additive attribute and selection of samples
fields of bioinformatics, signal processing, and biomedical engineering scheme using fuzzy rough set theory and a shuffled frog jumping
by removing the irrelevant and/or redundant size and features. The method. The concept of a bireduct is a further development of the idea
considerable specificity and potency exhibited by toxins render them of a reduct of rough sets (Ślęzak & Janusz, 2011). A bireduct believed to
highly valuable in the fields of pharmacology and drug discovery. be a type of classification algorithm that incorporates both data points
For instance, snake venoms have been explored for their potential and features. In Mac Parthaláin and Jensen (2013), a continuation of
as anticancer agents and anti-hypertensive drugs. The utilization of the research done in Ślęzak and Janusz (2011), the usage of fuzzy
computational methods for the prediction of animal poisons is im- rough set assisted model for both attribute and selection of samples was
perative due to the high costs and time constraints associated with explored, and a technique employing a frequency-based methodology
3
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
as a heuristics was designed for deciding between the two. In addition, the entries of the minority class. In this approach the newly added
a harmony-search based technique is presented to identify the bireduct entries does not add any valuable or new information.
(FBR) using the fuzzy rough-sets (Mac Parthaláin et al., 2019). Zhang, Other approach is to create new entries using the original minority
Mei, Li, Yang, and Qian (2022) established a bi-selection model (CFBR) class entries. It was implemented by Synthetic Minority Oversampling
by integrating both sample and feature reduction techniques, where Technique, abbreviated as SMOTE, which has been proven to be very
representative instances and fuzzy granular structure were employed. effective. This type of data augmentation is a far better improvement
Simultaneous feature and instance selection methods by combining in- from repeating/duplicating entries from minority class.
tuitionistic fuzzy and rough sets is in nascent stage as there is not much To be more specific, SMOTE selects a random data entry from the
works have been discussed in the literature. Recently, Jain et al. (2022) minority class. Next, some nearest neighbors for that data point are
introduced a bireduct model by incorporating an intuitionistic fuzzy established (Mukherjee & Khushi, 2021). Then from these neighbors
rough set theory, which introduced a novel outlier elimination method randomly a point was selected and a data point is synthesized by
to handle the uncertainty due to both judgement and identification. randomly choosing a point on the line between the data point and the
selected neighbor in the feature space. Artificial samples are generated
based on the following equation:
3. Preliminaries
𝑃 𝑖 ′ = 𝑃𝑖 + 𝑃𝑗
3.1. SMOTE 𝑃 𝑖 ′ = new example
𝑃𝑖 = positive example
In this section we discuss the traditional SMOTE (Chawla, Bowyer, 𝑃𝑗 = positive example [k-nearest neighbor of 𝑃𝑖 ]
Hall, & Kegelmeyer, 2002) to tackle imbalanced datasets classifica-
tion. The challenge of training by any machine learning technique 𝑃 𝑖 ′ = 𝑃𝑖 + ∧(𝑃 𝑖 − 𝑃 𝑗)
on imbalanced datasets is disregarding the minority class, and not ∧ = random value [0, 1]
producing a good performance as the training gets biased towards
majority class/classes resulting in majority class classifiers (Nath & 3.2. Kernelized Fuzzy C-means clustering
Subbiah, 2015). The most straightforward approach to handle an imbal-
anced dataset was employed by increasing the minority class samples, Kernelized Fuzzy C-Means clustering (Zhang & Chen, 2004) is an
i.e. oversampling of the minority class. The easiest approach is to repeat extension of Fuzzy C means clustering, where kernel based machine
4
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
learning models have been proved to be very powerful and successful 4. Proposed SMOTE-IFRST-2T algorithm
in many applications such as natural language processing, pattern
recognition, feature selection etc. (Menchetti, Costa, Frasconi, & Pontil, In this section, we discuss intuitionistic fuzzification of real valued
2005; Nath, 2021). The main aim behind kernel based algorithms is datasets. Then, an intuitionistic fuzzy rough set model is investigated.
to use kernel substitution trick. It involves mapping data space to an Here, positive region is described on the basis of lower approxima-
attribute space. The dataset which is in low dimension space is mapped tion. Next, a discernibility matrix assisted feature selection method is
to a higher dimensional feature space using inner product, which is presented. Moreover, positive region is utilized to eliminate redundant
one kind of conversion of a non-linear problem to a linear problem. instances by using two thresholds.
This helps in solving the problem rather easily for the datasets with
high dimensions. Interested researchers can refer Ding and Fu (2016) Step 1: Intuitionistic fuzzification of data
to understand kernelized fuzzy-c means clustering in detail. The intuitionistic fuzzy c-means clustering algorithm is used to
find the membership function values for real-valued attributes. The
3.3. Random forest following equations are used to define the cluster centers, distance
metric and membership degree respectively:
Multiple learners are trained in a suitable way by using ensemble Let c, 𝐶𝑖 , 𝜇𝑐𝑖 , 𝜈𝑐𝑖 , and 𝜋𝑐𝑖 be the number of clusters considered,
learning methods (Breiman, 2001). Then, results are taken together to ith cluster center, the membership value of ith cluster center, the non-
obtain the solution for a given problem. It is observed that ensemble membership value of the ith cluster center, and the hesitancy value of
learning techniques generally perform better when compared to single the ith cluster center respectively. Then,
classifier as it can effectively handle noise and overfitting issues avail-
able in the data (Nath & Sahu, 2019). The working steps for AdaBoost 𝐶𝑖 = [𝜇𝑐𝑖 , 𝜈𝑐𝑖 , 𝜋𝑐𝑖 ] (1)
random forest is described in Algorithm 1.
∑𝑁 𝑚
𝑗=1 𝑢𝑖𝑗 𝜇 𝑥𝑗
𝜇 𝑐𝑖 = ∑𝑁 1≤𝑖≤𝑐 (2)
Algorithm 1 Adaboost_RandomForest (A) 𝑚
𝑗=1 𝑢𝑖𝑗
Input: Training Set A = {(a1 , b1 ), (a2 , b2 ), (a3 , b3 ), . . . (a𝑛 , b𝑛 )}, here ∑𝑁 𝑚
set A contains n elements pair 𝑗=1 𝑢𝑖𝑗 𝜈 𝑥𝑗
𝜈 𝑐𝑖 = ∑𝑁 1≤𝑖≤𝑐 (3)
Training instances a𝑖 𝑗=1 𝑢𝑚
𝑖𝑗
Labelled classes ranging from {−1, +1} b𝑖 ∑𝑁 𝑚
𝑗=1 𝑢𝑖𝑗 𝜋 𝑥𝑗
Distribution d𝑗 𝜋 𝑐𝑖 = ∑𝑁 1≤𝑖≤𝑐 (4)
𝑚
Specified number of iterations 𝑃 𝑗=1 𝑢𝑖𝑗
Output: Let 𝑑𝑖𝑗 and 𝑥𝑗 depicts the Euclidean distance and jth data instance,
1: Begin then,
2: for 𝑖 <- 1 to 𝑛 do
3: 𝑤1 (𝑖) ∶= 1𝑛 𝑑 𝑖𝑗 = ‖ 𝐶𝑖 − 𝑥𝑗 ‖
4: for j <- 1 to 𝑛 do
5: for i <- 1 to 𝑛 do
w √
∑ nj
6: 𝑑𝑗 (i) ∶= i=1 wj (i) 𝑑 𝑖𝑗 = ( 𝜇 𝑐𝑖 − 𝜇 𝑥𝑗 )2 + ( 𝜈 𝑐𝑖 − 𝜈 𝑥𝑗 )2 + ( 𝜋 𝑐𝑖 − 𝜋 𝑥𝑗 )2
7: ℎ𝑗 ∶= 𝐿𝑒𝑎𝑟𝑛 (𝐴, 𝑑𝑗 ) //see Algorithm 2
∑ (5)
8: 𝜀𝑗 := i 𝑑𝑗 ()[𝑛𝑗 𝑎𝑖 ! =𝑏𝑖 ]
𝜀
9: 𝛽𝑗 = (1−𝜀𝑗 ) 1
𝑗 𝑢𝑖𝑗 = 2
(6)
10: for i <- 1 to 𝑛 do ∑𝑐 𝑑 𝑖𝑗 𝑚−1
( )
11: wj+1 (i) = wj (i) 𝛽j1−[ℎ𝑗 (𝑎𝑖 )!=𝑏𝑖 ] 𝑘=1 𝑑 𝑘𝑗
∑P 1
12: ℎ𝑓 (a)= argmax j (log ) [ℎ𝑗 (a) = b] Here, 𝑢𝑖𝑗 represents the membership of jth data instance to ith
𝛽j
13: return cluster, and 𝑚 depicts weighted index.
14: End
5
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
a single attribute. Hence, the time complexity for all the attributes where, 𝑤𝑖 , 𝑤𝑗 ∈ [0, 1] , 𝑖 = 1, 2 … 𝑛, 𝑗 = 1, 2 … 𝑛,
∑𝑛 ∑𝑛
is 𝑂(𝑁 × |𝑈 | × 𝐶 2 × |𝐴|). This algorithm is performed with the space 2
𝑖=1 𝑤𝑖 = 1 𝑜𝑟
2
𝑗=1 𝑤𝑗 = 1.
complexity of 𝑂(|𝑈 | × 𝐶). Here, 𝑤𝑖 and 𝑤𝑗 are the weighted parameters, which is used to
Step 2: Calculating non-membership values from membership identify redundant dimension and size in the dataset. These parameters
values are used to reduce the computational complexity for high-dimensional
The non-membership values can be generated using membership large volume of datasets. Moreover, 𝑤𝑖 and 𝑤𝑗 can be adjusted to
values using the formulae given by Murofushi et al. (2000). outline the noise till certain extent for different domains.
Let 𝜇 be the membership value as computed by the above concept, ( )
then, the non-membership value can be as: Proposition 1. ∀𝑥𝑖 , 𝑥𝑗 ∈ 𝑈 , 0 ≤ 𝑅𝑑 𝑥𝑖 , 𝑥𝑗 ≤ 1
1 − 𝜇
𝜈= ,𝛼 > 0 (7) Proof. This proposition is obviously true based on the cosine value.
1+𝛼 𝜇
Here, 𝛼 is a fuzzy negation parameter, which is used to identify ( )
Proposition 2. ∀𝑥𝑖 , 𝑥𝑗 ∈ 𝑈 , 𝑅𝑑 𝑥𝑖 , 𝑥𝑗 = 𝑅𝑑 (𝑥𝑗 , 𝑥𝑖 )
the hesitancy in the dataset. Its value is computed with grid search
( ) ( )
by identifying a step value, which depends on the distribution of the ∑𝑛 ∑𝑛 𝜇ã (𝑥𝑖 ) 𝜇ã 𝑥𝑗 +𝜈ã (𝑥𝑖 ) 𝜈 ã 𝑥𝑗
conditional feature values. This parameter can be applied to form a Proof. 𝑅𝑑 (𝑥𝑖 , 𝑥𝑗 ) = 𝑗=1 𝑤𝑖 𝑤𝑗 √ 𝟐
𝑖=1 𝟐
√ ( ) ( )
𝜇ã (𝑥𝑖 )+𝜈ã (𝑥𝑖 ) 𝜇ã 𝑥𝑗 +𝜈ã𝟐 𝑥𝑗
𝟐
( ) ( )
subclass of general fuzzy step, which consisted of the characteristics ∑𝑛 ∑𝑛 𝜇ã 𝑥𝑗 𝜇ã (𝑥𝑖 )+𝜈ã 𝑥𝑗 𝜈 ã (𝑥𝑖 )
that can be effectively specified with an assignment of density to every
= 𝑗=1 𝑖=1 𝑤𝑗 𝑤𝑖 √ 𝟐 ( ) 𝟐 ( ) √ 𝟐 , ∀𝑥𝑖 = 𝑥𝑗 and ∀𝑥𝑗 = 𝑥𝑖
𝜇ã 𝑥𝑗 +𝜈ã 𝑥𝑗 𝜇ã (𝑥𝑖 )+𝜈ã𝟐 (𝑥𝑖 )
singleton genesis of information.
= 𝑅𝑑 (𝑥𝑗 , 𝑥𝑖 )
4.1. Intuitionistic Fuzzy rough feature selection ( ) ( )
Proposition 3. ∀𝑥𝑖 , 𝑥𝑗 ∈ 𝑈 , 𝑅𝑑 𝑥𝑖 , 𝑥𝑖 = 𝑅𝑑 𝑥𝑗 , 𝑥𝑗 = 1
In this segment, we present a different class ratio based approach
to find the reduct set in Intuitionistic Fuzzy information system. In Proof. For ∀𝑥𝑗 = 𝑥𝑖 ,
this method, we make use of similarity between instances for a given ( ) ( ) ( ) ( )
𝑛 ∑
∑ 𝑛
𝜇ã 𝑥𝑖 𝜇 ã 𝑥𝑖 + 𝜈ã 𝑥𝑖 𝜈 ã 𝑥𝑖
set of attributes to find the reduct set. Intuitionistic fuzzy sets, as we 𝑅𝑑 (𝑥𝑖 , 𝑥𝑖 ) = 𝑤𝑖 𝑤𝑖 √ ( ) ( ) √ 𝟐( ) ( )
know, consists of a membership grade as well as a non-membership 𝑖=1 𝑖=1 𝜇ã𝟐 𝑥𝑖 + 𝜈ã𝟐 𝑥𝑖 𝜇ã 𝑥𝑖 + 𝜈ã𝟐 𝑥𝑖
grade of a particular element which denotes the degrees to which the ( ) ( )
∑𝑛 ∑ 𝑛 𝜇 𝟐 𝑥𝑖 + 𝜈ã𝟐 𝑥𝑖 ∑𝑛
element belongs to and does not belong to a particular set, respec- = 𝑤𝑖 2 ã𝟐 ( ) 𝟐
( ) = 𝑤𝑖 2 = 1
tively. Membership is computed by using Kernelized Fuzzy c-means. 𝑖=1 𝑖=1 𝜇ã 𝑥𝑖 + 𝜈ã 𝑥𝑖 𝑖=1
Then, non-membership is computed by using Sugeno negation function. Now, For ∀𝑥𝑖 = 𝑥𝑗 ,
Next, hesitancy is determined by subtracting membership and non- ( ) ( ) ( ) ( )
𝑛 ∑
∑ 𝑛
𝜇ã 𝑥𝑗 𝜇 ã 𝑥𝑗 + 𝜈ã 𝑥𝑗 𝜈 ã 𝑥𝑗
membership from 1. Using this notion, we have extended the fuzzy 𝑅𝑑 (𝑥𝑗 , 𝑥𝑗 ) = 𝑤𝑗 𝑤𝑗 √
( ) ( ) √ 𝟐( ) ( )
rough set based feature selection technique to apply feature selec- 𝑗=1 𝑗=1 𝜇ã𝟐 𝑥𝑗 + 𝜈ã𝟐 𝑥𝑗 𝜇ã 𝑥𝑗 + 𝜈ã𝟐 𝑥𝑗
tion in intuitionistic fuzzy information systems (Huang, Zhuang, Li, ( ) ( )
& Wei, 2013; Tan et al., 2018). Firstly, Kernelized Fuzzy C-Means ∑𝑛 ∑ 𝑛 𝜇 𝟐 𝑥𝑗 + 𝜈ã𝟐 𝑥𝑗 ∑𝑛
2 ã
= 𝑤𝑗 𝟐 ( ) ( ) = 𝑤𝑗 2 = 1
algorithm is applied to produce membership for the real-valued feature 𝑗=1 𝑗=1 𝜇ã 𝑥𝑗 + 𝜈ã𝟐 𝑥𝑗 𝑗=1
values. Next, non-membership is computed based on the information ( ) ( )
Hence, 𝑅𝑑 𝑥𝑖 , 𝑥𝑖 = 𝑅𝑑 𝑥𝑗 , 𝑥𝑗 = 1.
from Sugeno negation function. Secondly, similarity relation based
intuitionistic fuzzy rough set model is presented. Based on the lower
approximation, positive region is defined. Thereafter, a discernibility Definition 3. ∀𝑥𝑖 , 𝑥(( 𝑗 ∈ 𝑈 )), we can calculate an intuitionistic fuzzy
matrix is defined based on the definition of different classes’ ratio Euclidean distance 𝑥𝑖 , 𝑥𝑗 between 𝑥𝑖 and 𝑥𝑗 . Let 𝑥𝑖 ∈ 𝑈 be a
to obtain the reduct set, which leads to remove the noise also. The sample that represents the center of a circle and the radius is computed
algorithm for feature selection process is given in Algorithm 4. as ℎ = 1 − 𝑅𝑑 (𝑥𝑖 , 𝑥𝑗 ), which is the largest value ∀𝑥𝑖 , 𝑥𝑗 ∈ 𝑈 . For
any 𝑥𝑖 ∈ 𝑟𝑖 in the circle, collection {of all (the samples ) } belongs to
a particular class is depicted by 𝐽 = 𝑥𝑗 | 𝑠 𝑥𝑖 , 𝑥𝑗 ≤ ℎ . Similarly,
Definition 1. Let 𝑈 , , 𝑎𝑛𝑑 be the non-empty finite collection of in-
collection of all the { samples,
( )which belongs } to different classes, is
tuitionistic fuzzy samples, set of non-empty finite conditional features,
{ } described as 𝐾 = 𝑧 | 𝑠 𝑥𝑖 , 𝑥𝑗 ≤ ℎ, 𝑧∉𝑟𝑖 . Then, we can define a
and decision features respectively. Here, 𝑈 = 𝑥1 , 𝑥2 , … 𝑥𝑛 , =
{ } different (classes’ ratio by:
ã , ã , … ã𝑚 . An intuitionistic fuzzy sample is given as ) |𝐽 |
{ 1 2 } 𝐷𝐶𝑅 𝑥𝑖 = |𝐾| , where, |∙| is the intuitionistic fuzzy cardinality.
⟨𝑥, 𝜇𝐴 (𝑥) , 𝜈𝐴 (𝑥)⟩ , 𝑥 ∈ 𝑈 , where, membership grade is defined as
𝜇𝐴 (𝑥) ∶ 𝑈 → [0, 1] and non-membership grade as 𝜈𝐴 (𝑥) ∶ 𝑈 → [0, 1]
for any instance, such that 0 ≤ 𝜇𝐴 (𝑥) + 𝜈𝐴 (𝑥) ≤ 1. Moreover, hesitancy Definition 4. Let 𝑈 , 𝑅𝑑 , and 𝐹 (𝑈 ) be the non-empty finite set of
degree of a sample 𝑥 belongs is given by 𝜋𝐴 (𝑥) = 1 − 𝜇𝐴 (𝑥) − 𝜈𝐴 (𝑥), instances, intuitionistic fuzzy relation on 𝑈 , intuitionistic fuzzy power
that satisfies 0 ≤ 𝜋𝐴 (𝑥) ≤ 1, ∀𝑥 ∈ 𝑈 . Now, partitions the samples of set of 𝑈 . If we take any threshold 𝜉 ∈ [0, 1] and a radius ℎ as
𝑈 into 𝑟 crisp equivalence classes, which can be given as follows: discussed in Definition 3. If 𝜂, 𝑆, and 𝑇 are standard negator, intuition-
⋂ 𝑟| | ⋂ 𝑟| istic fuzzy triangular conorm, and intuitionistic fuzzy triangular norm
|
( ) ⎡ ||𝜇[𝑥𝑖 ] 𝑖 | |𝜈[𝑥𝑖 ]
| |
𝑖 | ⎤
| ⎥ , (1 ≤ 𝑖 ≤ 𝑟) respectively. Then, intuitionistic fuzzy lower and upper approximations
𝑟
𝑖 𝑥𝑖 = ⎢ , (8)
⎢ | | | | ⎥ based on different classes’ ratio can be defined as follows:
⎣ |𝜇[𝑥𝑖 ] | |𝜈[𝑥𝑖 ] | ⎦
| | | | ( ) ( ( ( )) ( ))
𝑅𝑑𝑆 𝑟𝑖 𝑥𝑖 = 𝑖𝑛𝑓 𝑥𝑗 ∈ 𝑈 𝑆 𝜂 𝑅𝑑 𝑥𝑖 , 𝑥𝑗 , 1 − 𝐷𝐶𝑅 𝑥𝑗 , ∀𝑥𝑖 ∈ 𝑈 .
Definition 2. An intuitionistic fuzz similarity relation 𝑅𝒅 on 𝑈 can be 𝑑 ( ) (( ( )) ( ))
defined as follows: 𝑅𝑆 𝑟𝑖 𝑥𝑖 = 𝑠𝑢𝑝𝑥𝑗 ∈ 𝑈 𝑇 𝑅𝑑 𝑥𝑖 , 𝑥𝑗 , 𝐷𝐶𝑅 𝑥𝑗 , ∀𝑥𝑖 ∈ 𝑈 .
( ) ( ) ( ) ( )
∑𝑛 ∑ 𝑛
𝜇ã 𝑥𝑖 𝜇 ã 𝑥𝑗 + 𝜈ã 𝑥𝑖 𝜈 ã 𝑥𝑗 Moreover, different classes’ ratio ∀𝑥𝑗 ∈ 𝑈 , can be further general-
𝑅𝑑 (𝑥𝑖 , 𝑥𝑗 ) = 𝑤𝑖 𝑤𝑗 √ ( ) ( ) √ 𝟐( ) ( ), ized by:
𝑖=1 𝑗=1 𝜇ã𝟐 𝑥𝑖 + 𝜈ã𝟐 𝑥𝑖 𝜇ã 𝑥𝑗 + 𝜈ã𝟐 𝑥𝑗 { ( )
( ) 𝜉, 𝐷𝐶𝑅 𝑥𝑗 ≤ 𝜉
∀𝑥𝑖 , 𝑥𝑗 ∈ 𝑈 , and ã ∈ . ̃ 𝑥𝑗 =
𝐷𝐶𝑅 (9)
1 − 𝜉, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
6
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
Here, 𝜉 is used to discard the noise available in the samples. Thus, 𝑅𝑑 𝐴 ⊆ 𝐴
Now, we can generalize the lower and upper approximations of (
( ) 𝑑 ( ) 𝑆𝑢𝑝 ( ))
decision 𝑟𝑖 with respect to attribute as follows: 𝑅𝑑 𝑟𝑖 𝑥𝑗 = 𝑅 𝐴 𝑥𝑖 = 𝑇 𝑅𝑑 (𝑥𝑖 , 𝑥𝑗 ), 𝐷𝐶𝑅
̃ 𝑥𝑗
⎧ ⎡ [ ( ( ) ) ( )] 𝑥𝑗 ∈ 𝑈
⎪ ⎢ min𝑥𝑖 ∈𝑈 max 𝜂 𝜇Rd 𝑥𝑖 , 𝑥𝑗 , 𝐷𝐶𝑅 ̃ 𝑥𝑖 , ⎤ ( ( ))
⎥ ≥ 𝑇 𝑅𝑑 (𝑥𝑖 , 𝑥𝑖 ), 𝐷𝐶𝑅
̃ 𝑥𝑖
⎪ ⎢ [ ( ( )) ( )] ⎥ , 𝑥𝑗 ∈ 𝑖
𝑟
⎨ ⎢ max ̃ ⎥
𝑥𝑖 ∈𝑈 min 𝜂 𝜈Rd 𝑥𝑖 , 𝑥𝑗 , 𝐷𝐶𝑅 𝑥𝑖
⎪ ⎣ ⎦ As 𝑥𝑖 is a normal sample, 𝑥𝑖 can be used to compute 𝑅 𝐴 (𝑥𝑖 )
𝑑
⎪ [0, 1] , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
⎩ ( ( )) ( ) ( )
𝑑 ( ) 𝑇 𝑅𝑑 (𝑥𝑖 , 𝑥𝑖 ), 𝐷𝐶𝑅
̃ 𝑥𝑖 = 𝑇 (1, 𝐴 𝑥𝑖 ) = 𝐴 𝑥𝑖
𝑅 𝑟𝑖 𝑥𝑗
𝑑
⎧ [ ( ( )) ( )] Thus, 𝐴 ⊆ 𝑅 𝐴
⎡ max ̃ 𝑥𝑖 , ⎤
⎪ ⎢ 𝑥𝑖 ∈𝑈 min 𝜂 1−𝜇 Rd 𝑥𝑖 , 𝑥𝑗 , 𝐷𝐶𝑅 ⎥ 𝑑
⎪ [ ( Accordingly, 𝑅𝑑 𝐴 ⊆ 𝐴 ⊆ 𝑅 𝐴 holds.
)) ( )] ⎥ , 𝑥𝑗 ∈ 𝑖
𝑟
= ⎨ ⎢ (
⎢ min𝑥𝑖 ∈𝑈 max 𝜂 1 − 𝜈Rd 𝑥𝑖 , 𝑥𝑗 , 𝐷𝐶𝑅 ̃ 𝑥𝑖 ⎥
⎪ ⎣ ⎦
⎪ Proposition 6. Suppose 𝐴 ⊆ 𝑈 is a set, then the following statement holds:
⎩ [0, 1] , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑅𝑑 𝐴 ⊇ 𝑅𝑑𝑆 𝐴
( )
Moreover, we can easily conclude that 𝑅𝑑 𝑟𝑖 𝑥𝑗 indicates that the 𝑑
𝑅 𝐴 ⊆ 𝑅𝑑𝑇 𝐴
degree of certainty by which sample 𝑥𝑗 is associated with category 𝑖,
𝑑 ( )
and 𝑅 𝑟𝑖 𝑥𝑗 depicts the possibility of 𝑥𝑗 associated with category 𝑖. Proof. For all 𝑥𝑖 ∈ 𝑈
Next, intuitionistic fuzzy positive region can be computed by using the
𝐼𝑛𝑓 ( ( ) ( ))
definition of lower approximation as follows: 𝑅𝑑 𝐴 = 𝑆 𝜂𝑅𝑑 𝑥𝑖 , 𝑥𝑗 , 1 − 𝐷𝐶𝑅𝐹
̃ 𝑥𝑖
( ) [ ( ) ( )] ̃ (𝑈 ) ≤ 𝜉
𝐷𝐶𝑅
𝑃 𝑂𝑆 𝑑 𝑥𝑗 = max𝑖 𝜇𝑅𝑑 𝑟 𝑥𝑗 , min𝑖 𝜈𝑅𝑑 𝑟 𝑥𝑗 (10)
𝑖 𝑖 ( ( ( )) ( ))
( ) ( ) ( ) ≥ 𝐼𝑛𝑓 𝑆 𝜂 𝑅𝑑 𝑥𝑖 , 𝑥𝑗 , 𝜂𝐹 𝑥𝑗
where, 𝜇𝑅𝑑 𝑟 𝑥𝑗 and 𝜈𝑅𝑑 𝑟 𝑥𝑗 are contained in 𝑅𝑑 𝑟𝑖 𝑥𝑗 .
𝑖 𝑖
= 𝑅𝑑𝑆 𝐴
Proposition 4. For all 𝐴 ∈ 𝐹 (𝑈 ), the following statement holds:
( 𝑑 ) Thus 𝑅𝑑 𝐴 ⊇ 𝑅𝑑𝑆 𝐴
𝑅𝑑 𝐴 = 𝜂 𝑅 (𝜂 (𝐴))
𝑑 ( ) 𝑑 𝑆𝑢𝑝 ( ( ))
𝑅 𝐴 = 𝜂 𝑅𝑑 (𝜂 (𝐴)) 𝑅 𝐴 = ( ) 𝑇 𝑅𝑑 (𝑥𝑖 , 𝑥𝑗 ), 𝐷𝐶𝑅
̃ 𝑥𝑗
̃ 𝑥𝑗 ≤ 𝜉
𝐷𝐶𝑅
( 𝑑 ) ( ( ) ( ))
𝑆𝑢𝑝
Proof. For all 𝑥𝑖 ∈ 𝑈 𝜂 𝑅 (𝜂 (𝐴)) = ≤ ( ) 𝑇 𝑅𝑑 𝑥𝑖 , 𝑥𝑗 , 𝐹 𝑥𝑗
( ̃ 𝑥𝑗 ≤ 𝜉
𝑆𝑢𝑝 ( ( ) ( ))) 𝐷𝐶𝑅
𝜂 ̃ 𝑥𝑗
𝑇 𝑅𝑑 𝑥𝑗 , 𝑥𝑖 , 𝐷𝐶𝑅
𝑥𝑗 ∈ 𝑈 = 𝑅𝑑𝑇 𝐴(𝑥)
𝐼𝑛𝑓 ( ( ) ( ( ))) 𝑑
= 𝜂 𝑇 𝑅𝑑 𝑥𝑗 , 𝑥𝑖 , 𝐷𝐶𝑅
̃ 𝑥𝑗 Thus, 𝑅 𝐴 ⊆ 𝑅𝑑𝑇 𝐴
𝑥𝑗 ∈ 𝑈 The IF relation for a given subset 𝑃 ⊆ can be defined as below:
𝐼𝑛𝑓 (( ( ( ))) ( ))
= 𝜂 𝑅𝑑 𝑥𝑗 , 𝑥𝑖 ̃ 𝑥𝑗
, 1 − 𝐷𝐶𝑅 ∩
𝑥𝑗 ∈ 𝑈
𝑆 𝑅𝑑𝑝 = 𝑅𝑑{𝑎 (11)
𝑎𝑔 ∈ 𝑃 𝑔}
= 𝑅𝑑 𝐴 ( )
Definition 5. The binary relation 𝐷𝐼𝑆 ′ 𝑅𝑑{ } can be defined as the
𝑎𝑔
( )
𝜂 𝑅𝑑 (𝜂 (𝐴)) relative discernibility relation of conditional attribute 𝑎𝑔 with respect
( ( ( ( ( )
𝐼𝑛𝑓 )) ( ))) to a given decision attribute 𝐷 on IFRS model, which can be calculated
= 𝜂 𝑆 𝜂 𝑅𝑑 𝑥𝑗 , 𝑥𝑖 , 1 − 𝐷𝐶𝑅
̃ 𝑥𝑗 by:
𝑥𝑗 ∈ 𝑈
( )
𝑆𝑢𝑝 (( ( ( ))) ( ))
= 𝜂𝑆 𝜂 𝑅𝑑 𝑥𝑗 , 𝑥𝑖 ̃ 𝑥𝑗
, 𝜂(1 − 𝐷𝐶𝑅) 𝐷𝐼𝑆 𝑘 𝑅𝑑{ }
𝑎𝑔
𝑥𝑗 ∈ 𝑈 { }
(( ( ) ( ) [ ]
𝑆𝑢𝑝 ) ( )) = 𝑥𝑖 , 𝑥𝑗 ∈ 𝑈 𝑋 𝑈 , 1 − 𝑅𝑑{ } 𝑥𝑖 , 𝑥𝑗 ≥ 𝜆𝑖 , 𝑥𝑗 ∉ 𝑥𝑖 𝐷 (12)
= 𝑇 𝑅𝑑 (𝑥𝑗 , 𝑥𝑖 ) , 𝐷𝐶𝑅
̃ 𝑥𝑗 𝑎𝑔
𝑥𝑗 ∈ 𝑈
[ ]
𝑑 where, 𝜆𝑖 = 𝑅𝑑 𝑥𝑖 𝐷 (𝑥𝑖 )
= 𝑅 𝐴
( ) ( )
Let 𝐷𝐼𝑆 𝑘 𝑅𝑑 = 𝑈𝑎𝑔 ∈ 𝐷𝐼𝑆 𝑘 𝑅𝑑𝑎 , and it is obvious that
Proposition 5. ∀𝐴 ∈ 𝐹 (𝑈 ), if 𝑥 is a normal sample, the following ( ) ( ) 𝑔
7
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
Algorithm 4 Intuitionistic fuzzy rough quick reduct based on majority samples, where membership grade is weak and non-
kernelized fuzzy c-means and different classes’ ratio membership grade is stronger to the intuitionistic fuzzy positive
Input: set of all conditional features region. 𝑇 represents number of iterations, which can be either
set of decision features a predetermined or generalized value. For highly imbalanced
reduct set 𝑅 dataset, it is predetermined number of steps, whilst for less im-
collection of samples 𝑈 balanced datasets, it is the value where the dataset reaches till
Output: 𝑅 optimally balanced level.
1: Begin
2: Compute intuitionistic fuzzy data for the real-valued data using
kernelized fuzzy c-means followed by Sugeno negation function
3: Calculate intuitionistic 𝑑 Algorithm 5 SMOTE-IFRST-2T algorithm
( ) fuzzy relation 𝑅 based on every different
̃ Input: threshold for artificial samples 𝛾 𝑠
classes’ ratio 𝐷𝐶𝑅
Threshold for majority class instances 𝛾𝑚
4: Calculate 𝜆𝑖 𝑓 𝑜𝑟 𝑎𝑙𝑙 𝑥𝑖 ∈ 𝑈
Maximum number of possible iterations 𝑇
5: R ← {}
Array of majority instances maj []
6: T ← R ( )
( ) Array of minority instances min []
7: Compute 𝐷𝐼𝑆 𝑘 𝑅𝑑{and 𝐷𝐼𝑆 𝑘 𝑅𝑑
} Result Set 𝑟𝑆𝑒𝑡
𝑎𝑔
( )
( ) Execution Number 𝑒𝑥𝑛𝑁𝑢𝑚
8: Sort 𝑥𝑖 , 𝑥𝑗 ∈ 𝐷𝐼𝑆 𝑘 𝑅𝑑{ } according to 𝑌𝑖𝑗 Output: rSet
𝑎𝑔
( ) ( )
9: Select the 𝑥𝑖 , 𝑥𝑗 ∈ 𝐷𝐼𝑆 𝑘 𝑅𝑑 1: Begin
( )
( ) 2: 𝑟𝑆𝑒𝑡 = min[]
10: Select an attribute 𝑎𝑔 such that 𝑥𝑖 , 𝑥𝑗 ∈ 𝐷𝐼𝑆 𝑘 𝑅𝑑{ } and add
𝑎𝑔 3: 𝑒𝑥𝑛𝑁𝑢𝑚 = 0
it to T ( ) 4: 𝑛𝑠𝑦𝑛𝑡 = 𝑛𝑚𝑎𝑗 = 0
( ) ( ) 5: Apply kernelized fuzzy C-means to compute membership grades
11: Compute 𝐷𝐼𝑆 𝑘 𝑅𝑑 = 𝐷𝐼𝑆 𝑘 𝑅𝑑 − 𝐷𝐼𝑆 𝑘 𝑅𝑑{ }
𝑎𝑔
values
12: R←T 6: Apply Sugeno negation function to compute non-membership
13: ̃ 𝜑
Until 𝐷𝐶𝑅≠ grades
14: return 𝑅 7: Calculate cardinality for each intuitionistic fuzzy values
15: End 8: 𝑖𝑠𝑏𝑎𝑙𝑎𝑛𝑐𝑒 = 𝑓 𝑎𝑙𝑠𝑒
9: 𝑤ℎ𝑖𝑙𝑒 (𝑒𝑥𝑒𝑁𝑢𝑚 ≤ 𝑇 ) && ! (𝑖𝑠𝑏𝑎𝑙𝑎𝑛𝑐𝑒) 𝑑𝑜
10: Apply SMOTE to generate an array 𝑠𝑦𝑛𝑚𝑖𝑛𝑙𝑖𝑠𝑡[] of artificial
4.2. Intuitionistic fuzzy rough instance selection minority samples
11: 𝑓 𝑜𝑟 𝑘 ← 1 𝑡𝑜 (𝑠𝑦𝑛𝑚𝑖𝑛𝑙𝑖𝑠𝑡 [] .len) 𝑑𝑜
In this section, we introduce a new method to discard irrelevant 12: Calculate POS (𝑠𝑦𝑛𝑚𝑖𝑛𝑙𝑖𝑠𝑡 [𝑘])
instances available in both minority and majority classes of a given 13: 𝑖𝑓 𝑃 𝑂𝑆(𝑠𝑦𝑛𝑚𝑖𝑛𝑙𝑖𝑠𝑡 [𝑘] ≥ 𝛾𝑠 )
real-valued imbalanced dataset. We start by examining our dataset 14: 𝑟𝑆𝑒𝑡 = 𝑟𝑆𝑒𝑡 ∪ 𝑠𝑦𝑛𝑚𝑖𝑛𝑙𝑖𝑠𝑡 [𝑘]
to check if the dataset is imbalanced. Then, SMOTE is applied to 15: 𝑛𝑠𝑦𝑛𝑡 + +
populate the minority class with synthetic data instances. Further, 16: 𝑒𝑛𝑑 𝑖𝑓
above positive region of intuitionistic fuzzy rough set is extended to 17: 𝑒𝑛𝑑 𝑓 𝑜𝑟
incorporate with two thresholds by considering cardinality of intuition- 18: 𝑓 𝑜𝑟 𝑙 ← 1 𝑡𝑜 maj [] . 𝑙𝑒𝑛 𝑑𝑜
istic fuzzy set to eradicate problematic minority and majority examples. 19: Calculate POS(maj [𝑙] ))
( )
This method is described as synthetic minority optimization with in- 20: 𝑖𝑓 POS 𝑚𝑎𝑗 [𝑙] ≥ 𝛾𝑚
tuitionistic fuzzy rough based on two thresholds (SMOTE-IFRS-2T). 21: 𝑟𝑆𝑒𝑡 = 𝑟𝑆𝑒𝑡 ∪ 𝑚𝑎𝑗 [𝑙]
SMOTE-IFRS-2T consists of following steps as explained below. 22: 𝑛𝑚𝑎𝑗 + +
23: 𝑒𝑛𝑑 𝑖𝑓
• Convert real-valued datasets into intuitionistic fuzzy information 24: 𝑒𝑛𝑑 𝑓 𝑜𝑟
system by using kernelized fuzzy c-means followed by sugeno- 25: 𝑏𝑎𝑙𝑎𝑛𝑐𝑒 = (𝑛𝑚𝑎𝑗 == (𝑛𝑠𝑦𝑛𝑡 + 𝑛𝑚𝑖𝑛 [] .𝑙𝑒𝑛))
negation function 26: 𝑒𝑥𝑛𝑁𝑢𝑚 + +
• Compute the cardinality for every intuitionistic fuzzy values 27: 𝑒𝑛𝑑 𝑤ℎ𝑖𝑙𝑒
• Populate the minority class with artificial examples in the dataset 28: return 𝑟𝑆𝑒𝑡
• Insert the minority synthesized instances whose cardinality values 29: End
to intuitionistic fuzzy-rough positive region 𝑃 𝑂𝑆 𝑑 (x𝑠 ) of our
dataset, which is more than threshold 𝛾𝑠 .
• Remove the minority synthesized instances and majority instances Entire instance selection is described in Algorithm 5. It comprises
whose cardinality values to intuitionistic fuzzy-rough positive computation of positive region followed by instance selection. During
region 𝑃 𝑂𝑆 𝑑 (x𝑚 ) of our dataset is less than the threshold 𝛾𝑠 . positive region computation, time complexity is 𝑂(|𝑈 |2 × |𝐴|) and time
𝛾𝑠 is used to deal with the insertion of artificial instances of minor- complexity for instance selection is 𝑂(|𝑈 |). Hence, the overall time
ity class during data balancing. The artificial samples are included complexity is 𝑂(|𝑈 |3 ×|𝐴|). The worst case space complexity is observed
till cardinality of intuitionistic fuzzy rough positive region is more to be 𝑂(|𝑈 | × |𝐴|).
than 𝛾𝑠 to minimize the inclusion of redundant samples during In the current work, entire data reduction process (both instance
data balancing. 𝛾𝑚 is used to handle the exclusion of redundant and feature selection) is performed by using 10-fold cross validation
instances during instance selection process for balanced dataset. technique. Here, overall dataset is distributed into ten equal chunks.
Here, only those samples from both minority and majority class The nine chunks of the dataset are employed for training, whilst rest of
samples are eliminated for which the cardinality of intuitionistic one part is employed for testing. The entire process proceeds till each
fuzzy rough positive region is less than 𝛾𝑚 . For highly imbalanced set is employed for testing and the produced final result is computed
datasets, smaller value of 𝛾𝑚 ensures to remove only a few of by taking average value of the performance for all the ten sets.
8
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
In this segment, we have come with thorough experimental study, All the experiments were conducted on Weka 3.8 (Hall et al., 2009),
which illustrates the robustness, generalization, and practical verifica- where hardware specifications were Intel(R) Core(TM) i5-8265U CPU
tion of our suggested methodology. @ 1.60 GHz, 1.80 GHz and 8.00 GB RAM. We have conducted our
initial experiments with six imbalanced benchmark datasets from UCI
repository (Blake, 1998) and their descriptions are given in Table 1. In
5.1. Performance evaluation parameters this paper, experiments are conducted by using 10-fold cross validation
and 80:20 percentage split methods. We have started by applying
In the current study, evaluation of the machine learning algorithms fuzzy c-means for the fuzzification of real valued dataset to implement
has been done on the basis of both threshold dependent and threshold SMOTE-FRST-2T followed by fuzzy rough feature selection. Further,
independent performance evaluation parameters. These performance kernelized fuzzy c-means has been applied to elicit membership grades,
evaluation parameters can be computed from the different quadrants and non-membership has been evoked by using Sugeno negation func-
of the computed confusion matrix, which is illustrated as follows: tion for the intuitionistic fuzzification of the real valued benchmark
True Positive (TP): It is interpreted as those positive class instances, datasets. This process is carried out by using two thresholds 𝜖 and 𝛼,
which are forecasted as positive class instances. where the values of these parameters are taken as 0.7 and 2.8 respec-
tively. Then, we have used SMOTE-IFRST-2T followed by intuitionistic
True Negative (TN): it is depicted as those negative class instances,
fuzzy rough feature selection over this intuitionistic fuzzy information
which are forecasted as negative class instances.
system to construct datasets by avoiding the issues as discussed in
False Positive (FP): it is illustrated as those negative class instances,
Section 4. Intuitionistic fuzzy similarity relation have incorporated two
which are predicted as positive class instances. threshold parameters 𝑤𝑖 and 𝑤𝑗 . The values of these parameters are
False Negative (FN): it is explained as those positive class instances, taken in the interval of [0.8, 0.9] to conduct the entire experiments.
which are predicted as negative class instances. Feature selection method is performed based on the threshold 𝜉, and
Sensitivity: This parameter represents the percentage of correctly pre- the best results are calculated at 0.65. Then, instance selection is
dicted animal toxin peptides, and is computed by: executed by using two parameters 𝛾𝑠 and 𝛾𝑚 , where the results are
computed with their values as 0.8 and 0.85 respectively. These results
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = × 100 (13) are recorded in Table 1, where number of instances and features are
+
less in number as produced by intuitionistic fuzzy set based approach,
Specificity: This parameter gives the percentage of correctly predicted
when compared to fuzzy set assisted technique, except for Arcene,
animal non-toxin peptides, and is given by:
German, and Heart_disease_processed_hungarian datasets. Here, our
proposed approach has produced more instances and less features for
𝑆𝑝𝑒𝑐𝑖𝑓 𝑖𝑐𝑖𝑡𝑦 = × 100 (14)
+ Arcene and German datasets. For Heart_disease_processed_hungarian
Accuracy: This metric provides percentage of correctly predicted ani- dataset, we have recorded the more features rather than SMOTE-
mal toxin and animal non-toxin peptides, and is calculated as: FRST-2T model after synthetic data points generation with SMOTE.
Moreover, SMOTE-IFBR and SMOTE-FRST-2T-FS techniques have pro-
+ duced relatively inferior results in terms of overall dimension and size,
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = × 100 (15)
+ + + when compared to the proposed approach as mentioned in Table 1.
AUC: It demonstrates the value of area under the receiver operating Further, we have implemented two more data reduction techniques as
characteristic curve (ROC). If the value of AUC is closer to 1, then, the discussed by CFBR and FBR to perform the comparative study with the
predictor is described as the better one. recently presented approaches. These methods are found to be defi-
MCC: Mathew’s correlation coefficient is extensively applied perfor- cient performers. Both the techniques have eliminated proportionately
mance parameter for binary classification. An MCC value of 1 indicates inferior number of attributes and samples than proposed technique as
that the predictor is the best one and it is expressed by using the recorded in Table 1.
following equation: Next, the performances of widely used IBK and random forest clas-
sifiers are investigated in form of overall accuracies with possible stan-
× − × dard deviation over reduced datasets produced by fuzzy and intuition-
𝑀𝐶𝐶 = √ ( )( )( ) (16)
( + ) + + + istic fuzzy assisted techniques, and the results are included in Table 2.
From the results, it can be observed that our proposed methodology
Precision: It is a measure of how many selected items are relevant out has reported better results when compared to the previous approaches.
of all the items that were selected. It tells us how often the model is Moreover, CFBR and FBR based data reduction have also reported
correct when it predicts something as positive. crummy results both in terms of average accuracies and standard de-
viation values, except for Heart_disease_processed_hungarian and Iono-
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (17) sphere datasets. For Heart_disease_processed_hungarian, IBK has stated
+
better results by SMOTE-FRFS-2T-FS and FBR Further, SMOTE-FRFS-
Recall: It is a measure of completeness. It tells you how many of the
2T-FS reported better result by random forest. The best result was
relevant items your model managed to find from the total pool of recorded for Arcene dataset with average accuracy of 97.42% and
relevant items. It is about how well the model captures all the positive standard deviation of 3.93. Thereafter, visualization has been given
instances. by using Receiver Operator Characteristic (ROC) curve as depicted in
Figs. 2 and 3 for better understanding of the recorded results.
𝑅𝑒𝑐𝑎𝑙𝑙 = (18)
+ In addition to all the experimental tasks, statistical test is applied to
F-measure: It represents the harmonic mean of precision and recall, determine the significance of the proposed technique. In this investiga-
and it is a single score that incorporates recall as well as precision. The tion, we employ two notable testing mechanisms viz Freidman (Fried-
F-measure has a value between −1 and +1. The value of this variable man, 1940) and Dunn tests (Dunn, 1961). Multiple hypothesis testing
takes on a value of −1 in the worst-case situation and +1 in the ideal was used to compute F-statistics, which is carried out by Freidman test
one. The following formula is used to compute it: as follows:
(𝑀 ′ − 1)𝜒 ′ 2
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ×
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
(19) 𝐹′ =
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 𝑀 ′ (𝑁 ′ − 1) − 𝜒 ′ 2
9
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
Table 1
Benchmark datasets characteristics and their reduction.
Dataset I A Balanced dataset Reduction
SMOTE-IFBR SMOTE-FRST-2T-FS CFBR FBR SMOTE-IFRST-2T-FS
I A I A I A I A I A I A
Appendicitis 106 7 155 7 153 6 150 6 151 6 149 7 147 5
Arcene 200 10000 244 10000 205 177 198 105 221 233 229 209 209 91
German 1000 20 1423 20 1339 11 1289 5 1398 15 1401 12 1389 5
Heart_disease_processed_hungarian 294 13 404 13 389 8 370 4 371 11 388 14 355 5
Colon 62 32 88 32 82 15 79 12 82 21 76 23 75 10
Ionosphere 351 33 489 33 455 18 443 11 465 25 471 27 437 6
*I-Instances, A-Attributes
Table 2
Comparison of overall accuracies with standard deviation for the datasets produced by, SMOTE-IFBR, SMOTE-FRFS-2T-FS, SMOTE-IFRFS-2T-FS, CFBR, and FBR on the basis of
10-fold cross validation.
Dataset Classifier IFBR SMOTE-FRFS-2T-FS SMOTE-IFRFS-2T-FS CFBR FBR
Appendicitis IBK 84.75 ±5.993 91.79 ± 5.952 𝟗𝟑.𝟎𝟏 ± 𝟑.𝟏𝟔1 83.42 ± 7.21 86.29 ± 6.36
RANDOM FOREST 88.68 ± 5.893 92.68 ± 3.922 𝟗𝟔.𝟔𝟗 ± 𝟒.𝟑𝟏1 86.31 ± 6.98 88.47 ± 6.19
Arcene IBK 88.46 ± 4.473 92.24 ± 3.372 𝟗𝟔.𝟖𝟔 ± 𝟑.𝟐𝟕1 86.53 ± 5.87 89.24 ± 4.89
RANDOM FOREST 89.34 ± 4.123 93.34 ± 3.022 𝟗𝟕.𝟒𝟐 ± 𝟑.𝟗𝟑1 88.89 ± 5.28 90.21 ± 5.18
Colon IBK 86.26 ± 6.273 91.86 ± 5.252 𝟗𝟓.𝟖𝟗 ± 𝟒.𝟐𝟑1 84.73 ± 7.43 87.65 ± 6.86
RANDOM FOREST 88.25 ± 9.753 93.19 ± 8.642 𝟗𝟕.𝟐𝟕 ± 𝟕.𝟑𝟑1 87.31 ± 8.92 89.54 ± 9.31
Heart_disease_ IBK 83.47 ± 6.253 86.72 ± 5.011 86.09 ± 5.122 85.38 ± 5.81 87.39 ± 6.24
processed_hungarian RANDOM FOREST 85.69 ± 5.653 88.69 ± 4.452 𝟗𝟏.𝟓𝟓 ± 𝟒.𝟒𝟑1 86.91 ± 4.89 88.21 ± 5.12
German IBK 77.88±6.083 86.32 ± 6.312 𝟖𝟗.𝟎𝟔 ± 𝟓.𝟏𝟑1 75.56 ± 5.98 79.25 ± 5.17
RANDOM FOREST 79.41 ± 2.773 87.41 ± 2.192 𝟖𝟖.𝟒𝟕 ± 𝟐.𝟐𝟑1 78.64 ± 4.53 82.69 ± 5.21
Ionosphere IBK 86.29 ± 4.173 93.66 ± 2.292 𝟗𝟓.𝟐𝟗 ± 𝟐.𝟏𝟏1 83.89 ± 5.32 86.71 ± 4.42
RANDOM FOREST 89.77 ± 3.793 96.88 ± 3.231 95.69 ± 3.982 85.65 ± 4.81 88.32 ± 5.21
Average rank IBK 3 1.8 1.2
RANDOM FOREST 3 1.8 1.2
F-Statistics IBK 15
RANDOM FOREST 15
10
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
Fig. 2. AUC of four learning algorithms for (a) Appendicitis, (b) Arcene, and (c) Colon Fig. 3. AUC of four learning algorithms for (a) Heart disease processed hungarian, (b)
datasets produced by our proposed methodology. German, and (c) Ionosphere datasets produced by our proposed methodology.
are listed in Tables 4 and 5 respectively. Moreover, the results by these of animal toxin and non-toxin peptides was notified by Jain, Tiwari,
algorithms for the original dataset based on both 80:20 percentage split and Som (2021a) with sensitivity, specificity, average accuracy, MCC,
and 10-fold cross validation are listed in Tables 6 and 7 respectively. and AUC of 87.6%, 90.7%, 89.20%, 0.783, and 0.959 respectively.
From these results, it can be highlighted that our framework has greatly From the above mentioned results, it can be observed that our proposed
improved the performances of the different learning algorithms. More- methodology is superior when compared to all the existing approaches.
over, we have obtained much better results than previously reported Finally, a visualization of average accuracies for the original dataset
values related to the different evaluation parameters. Based on 10-fold and the pre-processed dataset by our proposed technique are showed
cross validation, our approach has listed the best results with sensi- by bar graph for SMO, Rotation Forest, PART, J48, Vote, and Random
tivity, specificity, overall accuracy, MCC, and AUC of 95.6%, 93.6%, Forest on the basis of both percentage split of 80:20 and 10-fold cross
96.9%, 0.891, and 0.969. Previous best results for the discrimination validation (Figs. 4 and 5).
11
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
Table 3
Original bioinformatics dataset (Pan et al., 2019) characteristics and their reduction
based on different approaches.
Method Instances Attributes Reduction
Instances Attributes
Jain et al. (2021a) 952 459 952 143
Proposed methodology 952 802 912 73
Table 4
Performance evaluation metrics of different machine learning algorithms with the
animal toxin dataset consisting of animal toxin and non-toxin peptides produced by
our proposed methodology using percentage split of 80:20.
Learning algorithm Sensitivity Specificity Accuracy MCC AUC
SMO 89.6 83.2 86.9 0.833 0.881
Rotation forest 83.7 84.0 80.9 0.773 0.811
PART 82.6 79.8 82.4 0.721 0.833
Fig. 4. Graphical visualization showing the overall accuracies of six classifiers for
J48 86.3 81.8 82.4 0.698 0.789
discriminating animal toxin and non-toxin peptides by using percentage split of 80:20.
Vote 91.5 89.8 90.5 0.887 0.891
Random forest 92.9 91.9 91.5 0.889 0.925
Table 5
Performance evaluation metrics of different machine learning algorithms with the
dataset consisting of animal toxin and non-toxin peptides produced by our proposed
methodology using 10-fold cross validation technique.
Learning algorithm Sensitivity Specificity Accuracy MCC AUC
SMO 90.6 86.8 89.2 0.844 0.887
Rotation forest 84.7 83.6 83.8 0.789 0.821
PART 84.8 81.9 83.2 0.811 0.822
J48 88.7 84.6 84.8 0.711 0.802
Vote 92.5 90.2 91.1 0.891 0.902
Random forest 95.6 93.6 96.9 0.891 0.969
Table 6
Performance evaluation metrics of different machine learning algorithms with the
original dataset consisting of animal toxin and non-toxin peptides using percentage
split of 80:20. Fig. 5. Graphical visualization showing the overall accuracies of six classifiers for
discriminating animal toxin and non-toxin peptides by using 10-fold cross validation.
Learning algorithm Sensitivity Specificity Accuracy MCC AUC
SMO 62.7 85.5 78.4 0.489 0.741
Table 8
Rotation forest 69.5 93.1 85.8 0.658 0.920
Performance evaluation metrics/parameters of learning methods for original Arrhythmia
PART 64.4 84.0 77.9 0.484 0.758
dataset using 10-fold cross validation.
J48 61.0 82.4 75.8 0.435 0.727
Vote 66.1 93.9 85.3 0.643 0.930 Learning algorithm Precision Recall F-Measure Accuracy AUC
Random forest 64.4 93.9 84.7 0.630 0.916 SMO 67.8 70.8 66.7 70.8 0.787
Rotation forest 67.8 73.5 70.2 73.5 0.865
PART 64.7 64.2 64.3 64.2 0.751
Table 7 J48 62.6 64.2 63.3 64.2 0.719
Performance evaluation metrics of different machine learning algorithms with the Vote 67.8 69.9 64.4 69.9 0.839
original dataset consisting of animal toxin and non-toxin peptides using 10-fold cross Random forest 65.0 67.3 60.7 67.3 0.826
validation technique.
Learning algorithm Sensitivity Specificity Accuracy MCC AUC
SMO 69.5 82.3 78.3 0.507 0.759
Rotation forest 66.8 89.9 82.7 0.587 0.903 Class 01 pertains to normal ECG readings. Class 09 to left bundle branch
PART 65.8 82.0 76.9 0.471 0.721 block, class 10 to right bundle block, class 06 to sinus bradycardia, class
J48 65.1 82.9 77.3 0.476 0.720
07 to ventricular premature contraction (PVC). Class 11 corresponds
Vote 65.8 93.4 84.8 0.633 0.920
Random forest 66.4 93.0 84.7 0.631 0.193
to the first degree of AV block, Class 12 to the second degree, Class
13 to the third degree, Class 14 to left ventricular hypertrophy, Class
15 to atrial fibrillation or flutter, and Class 16 to the remaining cases.
In Tables 8 and 9, the performance evaluation metrics for the various
We have further conducted more experiments using datasets from machine learning algorithms are tabulated for the original dataset
the domain of signal processing and bioinformatics to widen the exper- and reduced dataset respectively using the current approach for the
imental study. Moreover, these experiments are performed on the basis Arrhythmia dataset. A rise in all the performance evaluation metrics
of few more evaluation metrics namely precision, recall, F-measure to can be observed for all the machine learning algorithms using the
validate the performance of the proposed models efficiently. In the reduced dataset. The best performance is achieved by random forest
signal processing dataset (Guvenir, Acar, Demiroz, & Cekin, 1997), reaching an accuracy of 86% on the reduced dataset (a rise of 16
the objective is to identify the various forms and manifestations of points in comparison to original dataset). The best values are achieved
cardiac arrhythmia and assign it to one of the sixteen groups. There as precision, recall, F-measure, accuracy, and AUC of 87.2%, 86.0%,
are currently 279 feature values that characterize 452 patient records. 85.1%, 86.0%, and 0.940 respectively.
Class 02 denotes ischemic alterations (coronary artery disease), class 03 Next, we have chosen bioinformatics dataset (Shao et al., 2018)
denotes an old anterior myocardial infarction, class 04 denotes an old consisted of anti-oxidant and non-anti-oxidant peptides to conduct the
inferior myocardial infarction, and class 05 denotes sinus tachycardia. experiment. As antioxidant proteins have the capacity to remove an
12
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
Table 9 6. Conclusion
Performance evaluation metrics/parameters of learning methods for reduced
Arrhythmia dataset using 10-fold cross validation.
In this paper we proposed a new method to handle vagueness,
Learning algorithm Precision Recall F-Measure Accuracy AUC
class imbalance, uncertainty, noise, redundant, and irrelevant features.
SMO 83.6 77.9 75.7 77.9 0.806
Here, we combined kernelized fuzzy c-means with different class ratio
Rotation forest 85.1 84.8 84.8 84.8 0.935
PART 79.0 78.3 78.4 78.3 0.858 assisted intuitionistic fuzzy rough attribute reduction and SMOTE-IFRS-
J48 85.1 84.8 84.2 84.8 0.935 2T instance selection. Membership grades for the real-valued data was
Vote 84.9 84.6 83.9 84.6 0.939 generated by a new kernelized fuzzy c-means based approach. Further,
Random forest 87.2 86.0 85.1 86.0 0.940 sugeno negation function was applied to produce non-membership
grades followed by hesitancy degree. Moreover, a new similarity rela-
Table 10 tion based lower and upper approximations were discussed to initiate
Performance evaluation metrics/parameters of learning methods for original dataset an intuitionistic fuzzy rough set model. Next, different classes’ ratio
consisted of anti-oxidant and non-anti-oxidant using 10-fold cross validation. was employed to represent discernibility matrix based concept for a
Learning algorithm Precision Recall F-Measure Accuracy AUC novel attribute reduction technique by using the similarity relation
SMO 84.2 86.3 66.7 70.8 0.787 along with different classes’ ratio to eliminate redundant and/or ir-
Rotation forest 87.5 88.0 84.5 88.0 0.782
relevant features. Thereafter, positive region was presented by using
PART 80.0 80.6 80.3 80.6 0.592
J48 82.0 81.7 81.8 81.7 0.601 lower approximation. Furthermore, SMOTE-IFRS-2T was used to deal
Vote 88.2 86.3 80.2 86.3 0.742 with imbalanced data by creating artificial samples followed by elim-
Random forest 84.4 86.4 80.7 86.4 0.696 ination of redundant minority and majority class samples. We also
conducted an extensive experimental study to explore the performances
Table 11 of the different machine learning algorithms after reduction of the
Performance evaluation metrics/parameters of learning methods for reduced dataset datasets. Our experimental results depicted the effectiveness of the
consisted of anti-oxidant and non-anti-oxidant using 10-fold cross validation. proposed methods. Finally, a framework consisted of above mentioned
Learning algorithm Precision Recall F-Measure Accuracy AUC approaches was presented to increase the prediction of animal toxin
SMO 72.8 72.4 72.3 72.4 0.725 peptides.
Rotation Forest 92.7 92.7 92.7 92.7 0.980 The advantages of the entire study can be highlighted as follows:
PART 88.3 88.3 88.3 88.3 0.896
J48 87.3 87.3 87.3 87.3 0.876 • A novel intuitionistic fuzzy similarity relation has been employed
Vote 94.3 94.3 94.3 94.3 0.985 to present the intuitionistic fuzzy rough set model. This model can
Random Forest 91.8 91.7 91.7 91.7 0.972
be effectively utilized to present the data reduction techniques for
both balanced and imbalanced datasets.
• This work has introduced a novel method for intuitionistic fuzzi-
excess of free radicals, they have been found to be closely associated fication of the real-valued datasets, which can motivate the re-
with disease control. The field of studying antioxidant protein identifi- searchers to present different pre-processing and processing ap-
cation is expanding due to its potential medical applications. Initially, proaches to handle the large volume of information systems.
raw data was identified from the UniProt database. To enhance the • A new method has been established to cope with imbalanced
quality of dataset, the following procedure was put into place while datasets in intuitionistic fuzzy rough framework.
taking redundant sequences into account: (1) Maintain the sequences • An efficient approach has been discussed to handle the size and
that have been shown to produce antioxidant action in biological features for the imbalanced datasets, where the uncertainty due
to both judgment and identification was tackled precisely.
experiments. (2) Remove sequences with nonstandard letters, with
• This method has incorporated different classes’ ratio to avoid the
the exception of the alphabets of the twenty most common amino
noise, which discards the misclassification of large datasets.
acids, due to unclear interpretations. By analyzing characteristics of
the domain, 1567 protein sequences were determined as negative data Moreover, there are few limitations that can be outlined as below:
points and 710 antioxidant protein sequences as positive data points.
• This method consists of few parameters, the same values of these
Moreover, research has demonstrated that training outcomes produced
parameters may not perform effective results for all information
by redundant samples are untrustworthy (Chou, 2011). The CD-HIT
systems.
algorithm (Fu, Niu, Zhu, Wu, & Li, 2012) was used to eliminate se- • The time and space complexity can be further improved with
quences that shared more than 60% similarity with any sequence in incremental algorithm.
both positive and negative samples in order to prevent homology bias
and redundancy. Additionally, because of their unclear significance, In the future, we will give a more suitable method to general-
proteins with nonstandard letters like ‘‘B’’, ’’X’’, or ‘‘Z’’ were disquali- ize the values of the thresholds. Time and space complexity will be
fied. The dataset, in summary, comprises 1551 non-antioxidant protein improved with incremental bireduct methods. Moreover, intuitionistic
fuzzy c-means assisted techniques can be more efficient in the future.
sequences (negative samples) and 250 antioxidant protein sequences
Probabilistic approaches can be added with intuitionistic fuzzy rough
(positive samples). In Tables 10 and 11, the performance evaluation
concept to handle the reduction in more effective fashion. We want
metrics for the various machine learning algorithms are tabulated for
to extend this method to handle imbalanced datasets without using
the original dataset and reduced dataset using the current approach
SMOTE to avoid the generation of artificial samples. New similar-
respectively for the anti-oxidant and non-anti-oxidant dataset. A rise ity measures can be introduced in the future to improve the entire
in all the performance evaluation metrics can be observed for all methodology.
the machine learning algorithms using the reduced dataset. The best
performance is achieved by vote method reaching an overall accuracy CRediT authorship contribution statement
of 94.3% on the reduced dataset (a rise of 8 points in comparison
to original dataset). An effective improvement in precision, recall, F- Anoop Kumar Tiwari: Conceptualization, Problem formulation,
measure, and AUC is also reported based on our proposed method. Methodology, Original draft preparation, Writing – review & edit-
ing, and Final drafting. Abhigyan Nath: Data curation, Programming,
13
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
Simulation, Validation, Numerical analysis, Visualization, System set- Huang, B., Zhuang, Y.-L., Li, H.-X., & Wei, D.-K. (2013). A dominance intuitionistic
up. Rakesh Kumar Pandey: Mathematical Modelling, Visualization, fuzzy-rough set approach and its applications. Applied Mathematical Modelling, 37,
7128–7141.
Investigation. Priti Maratha: Supervision, Problem formulation, Pro-
Jain, P., Tiwari, A. K., & Som, T. (2020). A fitting model based intuitionistic fuzzy
gramming, Validation, Writing – review & editing. rough feature selection. Engineering Applications of Artificial Intelligence, 89, Article
103421.
Declaration of competing interest Jain, P., Tiwari, A. K., & Som, T. (2021a). Enhanced prediction of animal toxins using
intuitionistic fuzzy rough feature selection technique followed by SMOTE. In 2021
25th international conference on information technology (pp. 1–4). IEEE.
The authors declare that they have no known competing finan-
Jain, P., Tiwari, A. K., & Som, T. (2021b). Enhanced prediction of anti-tubercular
cial interests or personal relationships that could have appeared to peptides from sequence information using divergence measure-based intuitionistic
influence the work reported in this paper. fuzzy-rough feature selection. Soft Computing, 25, 3065–3086.
Jain, P., Tiwari, A. K., & Som, T. (2022). An intuitionistic fuzzy bireduct model and
Data availability its application to cancer treatment. Computers & Industrial Engineering, 168, Article
108124.
Jensen, R. (2008). Rough set-based feature selection: A review. In Rough computing:
Data will be made available on request. Theories, technologies and applications (pp. 70–107). IGI Global.
Jensen, R., & Cornelis, C. (2010). Fuzzy-rough instance selection. In International
Acknowledgment conference on fuzzy systems (pp. 1–7). IEEE.
Jensen, R., & Shen, Q. (2004). Fuzzy–rough attribute reduction with application to web
categorization. Fuzzy Sets and Systems, 141, 469–485.
This research work is not funded by any organization.
Jensen, R., & Shen, Q. (2008). New approaches to fuzzy-rough feature selection. IEEE
Transactions on Fuzzy Systems, 17, 824–838.
Ethical and informed consent for data used Ji, W., Pang, Y., Jia, X., Wang, Z., Hou, F., Song, B., et al. (2021). Fuzzy rough sets and
fuzzy rough neural networks for feature selection: A review. Wiley Interdisciplinary
No harm has been done to breach the safety line of human/animal. Reviews: Data Mining and Knowledge Discovery, 11, Article e1402.
Kumar, D., Verma, H., Mehra, A., & Agrawal, R. (2019). A modified intuitionistic fuzzy
c-means clustering approach to segment human brain MRI image. Multimedia Tools
References and Applications, 78, 12663–12687.
Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., et al. (2017). Feature
Anaraki, J. R., Samet, S., Lee, J.-H., & Ahn, C.-W. (2015). SUFFUSE: Simultaneous selection: A data perspective. ACM Computing Surveys, 50, 1–45.
fuzzy-rough feature-sample selection. Journal of Advances in Information Technology, Liu, J., & Xu, M. (2008). Kernelized fuzzy attribute C-means clustering algorithm. Fuzzy
6, 103–110. Sets and Systems, 159, 2428–2445.
Ashraf, M., Zaman, M., & Ahmed, M. (2019). To ameliorate classification accuracy Mac Parthaláin, N., & Jensen, R. (2013). Simultaneous feature and instance selection
using ensemble vote approach and base classifiers. In Emerging technologies in data using fuzzy-rough bireducts. In 2013 IEEE international conference on fuzzy systems
mining and information security: Proceedings of IEMIS 2018, Volume 2 (pp. 321–334). (pp. 1–8). IEEE.
Springer. Mac Parthaláin, N., Jensen, R., & Diao, R. (2019). Fuzzy-rough set bireducts for data
Atanassov, K. T., & Atanassov, K. T. (1999). Intuitionistic fuzzy sets, volume 35. Springer. reduction. IEEE Transactions on Fuzzy Systems, 28, 1840–1850.
Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering
Manavalan, B., Basith, S., Shin, T. H., Wei, L., & Lee, G. (2019). AtbPpred: A robust
algorithm. Computers & Geosciences, 10, 191–203.
sequence-based prediction of anti-tubercular peptides using extremely randomized
Blake, C. L. (1998). UCI repository of machine learning databases. http://www.ics.uci.
trees. Computational and Structural Biotechnology Journal, 17, 972–981.
edu/~mlearn/MLRepository.html.
Menchetti, S., Costa, F., Frasconi, P., & Pontil, M. (2005). Wide coverage natural
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
language processing using kernel methods and neural networks for structured data.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Syn-
Pattern Recognition Letters, 26, 1896–1906.
thetic minority over-sampling technique. Journal of Artificial Intelligence Research,
Mukherjee, M., & Khushi, M. (2021). SMOTE-ENC: A novel SMOTE-based method
16, 321–357.
to generate synthetic data for nominal and continuous features. Applied System
Chen, J., Mi, J., & Lin, Y. (2020). A graph approach for fuzzy-rough feature selection.
Innovation, 4, 18.
Fuzzy Sets and Systems, 391, 96–116.
Murofushi, T., Sugeno, M., et al. (2000). Fuzzy measures and fuzzy integrals. In Fuzzy
Chen, Z., Zhao, P., Li, F., Leier, A., Marquez-Lago, T. T., Wang, Y., et al. (2018).
measures and integrals: Theory and applications: Vol. 2000, (pp. 3–41). Physica-Verlag
iFeature: A python package and web server for features extraction and selection
Heidelberg.
from protein and peptide sequences. Bioinformatics, 34, 2499–2502.
Nath, A. (2021). Prediction for understanding the effectiveness of antiviral peptides.
Chou, K.-C. (2011). Some remarks on protein attribute prediction and pseudo amino
Computational Biology and Chemistry, 95, Article 107588.
acid composition. Journal of Theoretical Biology, 273, 236–247.
Nath, A., & Sahu, G. K. (2019). Exploiting ensemble learning to improve prediction of
Cornelis, C., De Cock, M., & Kerre, E. E. (2003). Intuitionistic fuzzy rough sets: At the
phospholipidosis inducing potential. Journal of Theoretical Biology, 479, 37–47.
crossroads of imperfect knowledge. Expert System, 20, 260–270.
Dai, J., Hu, H., Wu, W. -Z., Qian, Y., & Huang, D. (2017). Maximal-discernibility-pair- Nath, A., & Subbiah, K. (2015). Maximizing lipocalin prediction through balanced and
based approach to attribute reduction in fuzzy rough sets. IEEE Transactions on diversified training set and decision fusion. Computational Biology and Chemistry,
Fuzzy Systems, 26, 2174–2187. 59, 101–110.
Derrac, J., Cornelis, C., García, S., & Herrera, F. (2012). Enhancing evolutionary Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., & Kittler, J. (2010).
instance selection algorithms by means of fuzzy rough set based feature selection. A review of instance selection methods. Artificial Intelligence Review, 34, 133–143.
Information Sciences, 186, 73–92. Pan, Y., Wang, S., Zhang, Q., Lu, Q., Su, D., Zuo, Y., et al. (2019). Analysis and
Ding, Y., & Fu, X. (2016). Kernel-based fuzzy c-means clustering algorithm based on prediction of animal toxins by various Chou’s pseudo components and reduced
genetic algorithm. Neurocomputing, 188, 233–238. amino acid compositions. Journal of Theoretical Biology, 462, 221–229.
Dubois, D., & Prade, H. (1992). Putting rough sets and fuzzy sets together. In Intelligent Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information Sciences,
decision support: Handbook of applications and advances of the rough sets theory (pp. 11, 341–356.
203–232). Springer. Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support
Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American vector machines.
Statistical Association, 56, 52–64. Qiu, Z., & Zhao, H. (2022). A fuzzy rough set approach to hierarchical feature
Frank, E., & Witten, I. H. (1998). Generating accurate rule sets without global selection based on Hausdorff distance. Applied Intelligence: The International Journal
optimization. of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies,
Friedman, M. (1940). A comparison of alternative tests of significance for the problem 52, 11089–11102.
of m rankings. The Annals of Mathematical Statistics, 11, 86–92. Quinlan, J. R. (2014). C4.5: Programs for machine learning. Elsevier.
Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: Accelerated for clustering Ramentol, E., Gondres, I., Lajes, S., Bello, R., Caballero, Y., Cornelis, C., et al. (2016).
the next-generation sequencing data. Bioinformatics, 28, 3150–3152. Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker
Guvenir, H. A., Acar, B., Demiroz, G., & Cekin, A. (1997). A supervised machine maintenance: The SMOTE-FRST-2T algorithm. Engineering Applications of Artificial
learning algorithm for arrhythmia analysis. In Computers in cardiology 1997 (pp. Intelligence, 48, 134–139.
433–436). IEEE. Ramentol, E., Verbiest, N., Bello, R., Caballero, Y., Cornelis, C., & Herrera, F.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). (2012). SMOTE-FRST: A new resampling method using fuzzy rough set theory.
The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, In Uncertainty modeling in knowledge engineering and decision making (pp. 800–805).
11, 10–18. World Scientific.
14
A.K. Tiwari et al. Expert Systems With Applications 251 (2024) 124087
Rodriguez, J. J., Kuncheva, L. I., & Alonso, C. J. (2006). Rotation forest: A new classifier Wang, C., Qian, Y., Ding, W., & Fan, X. (2021). Feature selection with fuzzy-rough
ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, minimum classification error criterion. IEEE Transactions on Fuzzy Systems, 30,
1619–1630. 2930–2942.
Shao, L., Gao, H., Liu, Z., Feng, J., Tang, L., & Lin, H. (2018). Identification of Yang, X., Chen, H., Li, T., & Luo, C. (2022). A noise-aware fuzzy rough set approach
antioxidant proteins with deep learning from sequence information. Frontiers in for feature selection. Knowledge-Based Systems, 250, Article 109092.
Pharmacology, 9, 10–36. Yang, X., Chen, H., Li, T., Zhang, P., & Luo, C. (2022). Student-t kernelized fuzzy
Ślęzak, D., & Janusz, A. (2011). Ensembles of bireducts: Towards robust classification rough set model with fuzzy divergence for feature selection. Information Sciences,
and simple representation. In International conference on future generation information 610, 52–72.
technology (pp. 64–77). Springer. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338–353.
Suganya, R., & Shanthi, R. (2012). Fuzzy c-means algorithm-a review. International Zhang, D.-Q., & Chen, S.-C. (2004). A novel kernelized fuzzy c-means algorithm with
Journal of Scientific and Research Publications, 2, 1. application in medical image segmentation. Artificial Intelligence in Medicine, 32,
Tan, A., Wu, W.-Z., Qian, Y., Liang, J., Chen, J., & Li, J. (2018). Intuitionistic fuzzy 37–50.
rough set-based granular structures and attribute subset selection. IEEE Transactions Zhang, X., Mei, C., Chen, D., & Li, J. (2016). Feature selection in mixed data: A method
on Fuzzy Systems, 27, 527–539. using a novel fuzzy rough set-based information entropy. Pattern Recognition, 56,
Thangavel, K., & Pethalakshmi, A. (2009). Dimensionality reduction based on rough 1–15.
set theory: A review. Applied Soft Computing, 9, 1–12. Zhang, X., Mei, C., Chen, D., & Yang, Y. (2018). A fuzzy rough set-based feature
Tiwari, A. K., Shreevastava, S., Som, T., & Shukla, K. K. (2018). Tolerance-based selection method using representative instances. Knowledge-Based Systems, 151,
intuitionistic fuzzy-rough set approach for attribute reduction. Expert Systems with 216–229.
Applications, 101, 205–212. Zhang, X., Mei, C., Chen, D., Yang, Y., & Li, J. (2019). Active incremental feature
Tsang, E. C., Hu, Q., & Chen, D. (2016). Feature and instance reduction for PNN selection using a fuzzy-rough-set-based information entropy. IEEE Transactions on
classifiers based on fuzzy rough sets. International Journal of Machine Learning and Fuzzy Systems, 28, 901–915.
Cybernetics, 7, 1–11. Zhang, X., Mei, C., Li, J., Yang, Y., & Qian, T. (2022). Instance and feature selection
Verbiest, N., Cornelis, C., & Herrera, F. (2013). FRPS: A fuzzy rough prototype selection using fuzzy rough sets: A bi-selection approach for data reduction. IEEE Transactions
method. Pattern Recognition, 46, 2770–2782. on Fuzzy Systems.
15