A Three-Way Approach For Protein Function Classification (Deep Learning Based 3WC)
A Three-Way Approach For Protein Function Classification (Deep Learning Based 3WC)
functions is also essential for understanding how biological activities are performed at the
molecular level. This is useful in developing personalized medicine, more effective therapeutic
interventions as well as understanding biological entities as engineered systems [2–7]. On the
other hand, when the number of sequenced genomes are growing, the overwhelming majority
of new proteins with unknown functions continue to emerge at an exponential rate. Under
these conditions, it is not feasible to manually identify and assign functions to proteins. Intelli-
gent mechanisms are generally relied upon to automatically predict and assign functions to
proteins [8–10].
Several methods have been proposed for characterization of protein functions. The early
and conventional techniques were generally based on the most fundamental type of informa-
tion about proteins i.e., their amino acid sequence, utilizing tools such as Basic Local Align-
ment Search Tool (BLAST) [11]. Sequence of a protein determines its different characteristics
such as its sub-cellular localization, possible structural conformations as well as its functions
[3]. Some of the prominent approaches in this category can be found in [12–14]. With the
availability of data from massive high-throughput experiments, features based on different
data such as genomic contextual data and Protein-Protein Interactions (PPIs) data, has also
emerged. Recent and advanced computational methods utilized these and similar information
in designing approaches for prediction task. For example, features based on genomic contex-
tual data were utilized by [12, 15, 16], features based on protein-protein interaction data were
used by [17–20], and features exploiting function structure relationship were reported in [21–
23]. As we have access to more interesting information, we may expect more effective models
and approaches for precise prediction of protein functions.
Due to technological advancements, our understanding of biological processes is improving
and new features describing proteins are emerging on regular basis [3]. It is inevitable to
ignore these anticipated features in designing more effective and efficient prediction tech-
niques. An important issue that needs to be addressed in this context is how to develop effec-
tive models by incorporating and taking advantage from the ever evolving biological
information that leads to new features and characteristics of proteins. This however has gener-
ally been overlooked and received little or no attention in the existing literature. A general
assumption, although not explicitly stated, is that the information is being fixed (i.e., not
dynamic and evolving) while developing classification approaches. This assumption may not
be always useful, for instance, consider the classification of proteins whose functions may not
be precisely identified due to lack of associated biological information (although we may antic-
ipate it in future) thereby leading to compromised results. To address this issue, i.e., incorpo-
rating the anticipated future information into the predictive task, we propose a three-way
decision making approach that includes a decision option of deferment. This option is exer-
cised whenever we have inconclusive and insufficient evidence to reach confirmed or certain
decisions. The deferred decision option provides provisions for incorporating future informa-
tion which may be used in deciding the deferred cases. In particular, three types of decisions
are used, i.e., accept, reject and deferment in order to classify functions of proteins.
There are different models for inducing three-way decisions. In this article, we investigate
and examine probabilistic rough sets based three-way decision making approaches for protein
functions classification [24]. The probabilistic rough sets can be used to induce three regions
corresponding to a concept (represented in terms of a set), namely, positive, negative and
boundary regions. The three regions lead to three-way decisions in the form of acceptance,
rejection and deferment, respectively. The three regions and their respective decisions are
defined and controlled by a pair of thresholds. There are different forms and models of proba-
bilistic rough sets based on how these thresholds are obtained and interpreted. We consider
two such models, i.e., Game-Theoretic Rough Sets (GTRS) [25–27] and Information-Theoretic
Rough Sets (ITRS) [28]. Moreover, we examine and define five three-way approaches based on
the GTRS and ITRS by employing different measures and iterative methods. To incorporate
and take benefit from these three-way approaches in real applications, we propose an architec-
ture of protein functions classification. Lastly, we evaluated the three-way approaches on the
dataset of Saccharomyces cerevisiae species proteins which is obtained from Uniprot database
[29], with the corresponding functional classes extracted from the well known Gene Ontology
(GO) database [30]. The experimental results indicate that by increasing the level of biological
information associated with proteins, the number of deferred cases can be reduced while
maintaining the same level of accuracy. We comprehensively benchmark our approaches
under these settings and conclude that the classification becomes more crisp as the knowledge
of associated biological information matures.
The code (Python/Bash/Matlab) and data files used in this work are available as a zip file
(“Protein_Functions_TWD_data_code.zip”) from http://tinyurl.com/jdpwkkq.
Background
Protein function classification
An important factor that impacts the performance of function prediction models is the type of
biological information used to infer functional association among proteins. Until recently,
many high throughput techniques have been developed to devise mechanisms leading to pre-
cise prediction of protein functions. These techniques utilize information derived from
sequence similarity, protein 3D structure, phylogenetic profiles, protein complexes, PPIs, gene
expression profiles [31–33]. The most prominent techniques utilize proteome-scale PPI net-
works that have been retrieved for several organisms including yeast and human. Protein-pro-
tein networks are graphs where each node represents a protein and edges between nodes
represent an interaction. An interaction in the network is either a direct physical association
between the proteins (typically retrieved via two hybrid analysis [34] or on the other hand if
two interacting proteins are part of the same multi-protein complex, they are also considered
as interacting proteins [35]. Thus from informatics point of view an interaction is not neces-
sarily a direct physical association of proteins but sometimes it is mutual presence in the same
protein complex depending on the experiment which reveals the interaction.
The most recent as well as renowned approaches in the field of protein function prediction
use protein-protein interactions data in different ways [31–33]. A wide majority of these tech-
niques are based on the fact that interacting proteins are likely to share common functions as
they interact for an associated biological activity. Methods in this category assign annotations
to protein under question, based on the functions of their neighboring proteins. The methods
vary in the extent to which they employ global features of the interactome in annotating pro-
teins, or the way they exploit the topological features of the interactome [17, 18]. In addition to
that, the methods are based on quite varied underlying formulations and use well understood
concepts from the fields of graph theory, graphical models, stochastic processes, probabilistic
graphs or clustering [18, 19].
Another class of approaches are based on utilizing the GO structure into computational
models by incorporating the semantic similarity offered by the Direct Acyclic Graph (DAG)
architecture of gene ontology. The integration of multi-level gene ontology terms exploiting
their relationships for protein function prediction was investigated in [2, 8, 9, 36]. These meth-
ods calculate different similarity measures by operating on GO term dependencies to define
functional associations among proteins. A similar technique based on the Markov Random
Field (MRF) properties of protein-protein networks, integrated the inter-species protein
homolog information to construct MRF based graphs using the gene ontology terms was
outlined in [8]. The authors report high precision when tested for a limited set of functional
terms [8].
Another type of biological information that is frequently used for uncharacterized proteins
is the number of motifs conserved in those proteins [9, 36]. Several functionally conserved pro-
teins are found to have motifs that associate them to a particular molecular activity. For exam-
ple, hypothetical protein YIL169C is conserved with Chemotaxis_Transduce_2 and T_SNARE
motifs, and similar motifs in known proteins can be used to link functional information with
the protein under investigation. Integrating heterogeneous information conserved across pro-
teins of unknown function, with state of the art classification scheme may help to increase pro-
tein function prediction accuracy.
The existing computational approaches have significantly contributed in understanding
and characterizing protein functions by investigating and utilizing different types of features.
However, there is still a need for approaches to incorporate and integrate the ever evolving fea-
tures of proteins for precise prediction of their functions. These new features, once known and
available, will give better insight into biological activities thereby are expected to provide more
precise characterization of proteins. In the later sections of this article, we present a three-way
approach to address these issues.
Three-way decisions
In many real life decision making scenarios involving vague and uncertain information, the
three-way decision making strategy including a delay, deferment or non-commitment deci-
sion option is a better and more useful approach [37–39]. To explain this, consider the follow-
ing examples: 1) How do we make a purchase decision based on information gathered from
blogs, reviews, friend suggestions and experiences? 2) How do doctors make diagnosis deci-
sions based on the presence of some symptoms and tests? 3) How do military commanders
decide to carry out military actions based on intelligence information? In all these and similar
decision scenarios, the decision makers are faced with two types of situations. Either they have
sufficient and convincing information necessary to make a decision or they are faced with
vague and incomplete information which is insufficient to make a useful decision. In the for-
mer case, the decision makers can exercise immediate and certain decisions in the form of
acceptance/rejection, yes/no or true/false. In the latter case, the decision makers may not be
able to make certain decisions. For instance, the diagnosis tests are inconclusive or the intelli-
gence information is vague or incomplete. A better and more useful choice in such uncertain
and doubtful situations is to delay the decisions, assuming that future information will evolve
which will make the decision making more obvious and evident. Three-way decisions is essen-
tially the same approach to decision making. We make immediate accept or reject decisions if
we have convincing and sufficient evidence based on the available information. On the other
hand, we make a deferment decision whenever we lack sufficient evidence.
In fact, three-way decisions has been practiced over the years across different domains,
including, medical decision making, psychology, social judgment theory, management sci-
ences and machine learning [40–44]. These application domains suggest that three-way deci-
sions enjoy a good history from usage and application perspective, however, it is surprising to
note that from theoretical perspective, it lacks a unified formal description over the years [45].
This theoretical gap was first recognized in the rough sets community. In particular, Yao intro-
duced a general theory of three-way decisions, motivated by the rough sets based three regions
[45]. The essential notion in the theory adopted from rough sets is the division of the universe
into three pair-wise disjoint regions. The theory however is not restricted to rough sets and
goes beyond it by considering rough set theory as one of many possible ways to construct and
induce the three regions [39, 46]. Three-way decisions may be formulated based on the theo-
ries such as rough sets, interval sets, shadowed sets, approximations of fuzzy sets, a threshold
approach in medical and orthopairs [47–54].
An important consideration in formulating three-way decisions is the division of the uni-
versal set into three pair-wise disjoint regions. It is recently argued that an equally important
consideration is the design of effective strategies for processing the three regions [39]. The real-
ization of these two essential components, i.e., division and processing lead to the trisecting
and acting framework of three-way decisions [39].
The trisecting and acting framework explains and presents three-way decisions as a two
step process. In the first step, i.e., trisecting, the universe is divided into three pair-wise disjoint
regions. This means that we seek tripartition of the universe. In the second step, i.e., acting,
strategies are designed for processing the three regions to obtain three-way decisions. This
framework aimed at introducing three-way decisions at a more generic level. Generally, the
division of the universe is carried out based on an evaluation function and a pair of thresholds.
The evaluation function assign an evaluation value to each object by employing some criteria.
The objects whose evaluation values are at or above a certain threshold of acceptance makes
up the POS region. The objects whose evaluation values are at or below a certain threshold of
rejection make up the NEG region. The objects whose evaluation values are above the rejection
threshold but below the acceptance threshold make up the BND region. A specific definition
of evaluation based three-way decisions based on a single evaluation function (used for evalu-
ating both acceptance and rejection) and totally order set is given in [39].
There are many issues and challenges for building and using three-way decision models.
Some of these issues include the definition and construction of evaluation functions, the defi-
nition of the domains for the evaluation functions, the determination and interpretation of
acceptance and rejection levels, the measurement of the quality of the three regions, generation
of predictive rules from the three regions for making decisions on new objects, descriptive
rules for describing the three regions and design of strategies and actions corresponding to the
three regions [39, 45]. Based on how these issues are handled and interpreted, we may have dif-
ferent three-way decision making models and approaches. We focus on three-way decisions
with probabilistic rough sets.
Fig 1. Logical view of the architecture with three-way decisions for protein function classification.
doi:10.1371/journal.pone.0171702.g001
the systems. In some cases, it may also provide auto correction facility. Besides interface, there
are various other components at the core of the architecture namely Knowledge Discovery,
Feature Extractor, Learning Module, Knowledge Base and Control Functions. We now
explains each of them briefly with their intended functionalities.
Knowledge Discovery/Information Retrieval: The Knowledge Discovery module interacts
with both the worldwide biological databases and feature extractor module. As new features
are evolving, the feature extractor module may require different type of biological information
to compute feature values. On one hand, it will provide querying and searching facilities for
extracting information from relevant biological databases and on the other side, it is responsi-
ble for passing them to the feature extractor module.
Feature Extractor: The features describing proteins are computed based on relevant data
extracted from biological databases which are spread around the world. The feature extractor
module request or query the information retrieval component for providing relevant informa-
tion necessary for computing a feature value. The information retrieval component extracts
the required information from the world wide biological databases. For example, for getting
one of the features namely, protein interaction networks (PIN), this module will ask for rele-
vant information, i.e., number of interactors corresponding to a protein. The information
retrieval identifies and searches the relevant databases such as STRING and IntAct databases
and will pass the respective information to the feature extractor. The feature extractor module
is then responsible for further processing in order to calculate the feature value, e.g. the num-
ber of interactors present in both the databases. As new features become available due to tech-
nological advancements, the feature extractor module will ask for new type of information
from the information retrieval component and do the relevant computation and processing to
calculate feature values. In section Data Preparation, we describe different types of features
that have evolved over the time and explain the types of data that is required to computed
them.
Learning Module: The learning module interacts with the feature extractor and knowledge
base. This module will incorporate intelligent techniques to make useful inferences from the
data to reach effective classification decisions. In this article, we suggest a classification mecha-
nism based on three-way decisions as one of the possibility. An important output of the learn-
ing module will be a set of functions that are being performed by a protein.
Knowledge Base: The knowledge base contains necessary information that is learned and
made available by the learning module. The information, such as, decision thresholds and
rules for classifying proteins may be stored in the knowledge base for future use.
Control Functions: This module is included to ensure security and protect the system from
attacks and unauthorized usage. It should provide functionalities such as access rights and
permissions.
where U is a finite set of objects also known as the universe, At is a finite set of attributes, Va is
the domain of attribute a 2 At and Ia is an information function which provides a mapping
from U ! Va. In particular, the information function Ia assigns to each object x 2 U a value in
Va i.e., Ia(x) 2 Va. A major concern in rough set theory is how to discern objects. The equiva-
lence relation defined on U is used for this purpose. For a set of attributes A At, the equiva-
lence relation, namely, EA is defined as,
EA ¼ fðx; yÞ 2 U Uj8a 2 A; Ia ðxÞ ¼ Ia ðyÞg: ð2Þ
This means that any two objects x and y in U are equivalent or in other words indiscernible
based on attribute set A 2 At if they share the same values on all attributes in A.
The equivalence relation may be used to create equivalence classes which induces a parti-
tion of U denoted by U/E. An equivalence class with an object x is given by [x] = {y 2 U|xEy}.
The fundamental notion of rough set theory, i.e., approximations and the three regions are
defined using equivalence classes as follows.
aprðCÞ ¼ fx 2 U j ½x Cg; ð3Þ
The lower and upper approximations are used to define the positive, negative and boundary
regions (which leads to three-way decisions, already discussed in the section Three-way Deci-
sions) given by [24, 55],
POSðCÞ ¼ aprðCÞ
ð5Þ
¼ fx 2 U j ½x Cg;
c
NEGðCÞ ¼ ðaprðCÞÞ
ð6Þ
¼ fx 2 U j ½x \ C ¼ g;
The three regions has a simple but very meaningful interpretation. We accept an object as
belonging to the concept if it is in the positive region. We reject an object as belonging to the
concept if it is in the negative region. We defer the decision for an object as belonging to the
concept if it is in the boundary region. The three regions representation of rough sets defined
in Eqs (5)–(7) has lead to the introduction of the theory of three-way decisions [39]. In fact,
the major notion of three-way decisions, i.e., the division of universal set into three regions is
borrowed from rough sets.
The deferment decision option which is exercised based on the boundary region is useful in
at least two aspects. Firstly, it provides hints for seeking and incorporating anticipated future
information in the decision making model for making decisions on the deferred cases. It is
hoped that as information matures, the number of deferred cases will reduce thereby leading
to more precise decisions. Secondly, the deferred cases which are typically associated with high
levels of uncertainty and therefore, no obvious immediate decisions, the deferment decision
option may help avoiding some false decisions. The former aspect is of particular interest from
protein functions classification perspective We further elaborate this in the next section.
Using the above equivalence classes, we can compute the positive, negative and boundary
regions using Eqs (5)–(7). The three regions are given by,
POSðCÞ ¼ ;;
NEGðCÞ ¼ ;; ð9Þ
BNDðCÞ ¼ fO1 ; O2 ; O3 ; O4 ; O5 ; O6 ; O7 ; O8 g;
In the same way we can compute the three regions at time instances t1 and t2, when we have
additional information in the form of “interacting proteins” and “No. of Domains”. Table 2
summarizes the three regions corresponding to the information available at the three instances
of time. Looking at the three regions for the different time instances, we may note that objects
in the boundary region are decreasing and are becoming part of the positive or negative
regions as more information is available at time instances of t1 and t2. In other words, the addi-
tional information about the “interacting proteins” and “No. of Domains” of the objects has
increased the size of the positive and negative regions. This means that we can make more
decisions in the form of acceptance or rejection when the level of available information
increases. In this article, we argue that this property of three-way decision making can be quite
useful for making decisions on protein functions classification.
In order to see the same phenomena visually, we include Fig 3. In each sub figure, the
green, red and orange colours represents the positive, negative and boundary regions. The cir-
cle represents a certain concept and the small rectangles depict the equivalence classes. From
Fig 3(a), we have least information and in Fig 3(c), we have most information. In Fig 3(b), we
have moderate level information. We may note that as information matures, we have finer
Fig 3. The three regions with evolving information. The sub-figures from left to right should be read as a, b and c respectively.
doi:10.1371/journal.pone.0171702.g003
level details leading to refined partitions. This is shown by the smaller sized boxes in Fig 3(b)
and 3(c). The finer level details due to additional information enables us to move some of the
equivalence classes from boundary to either positive or negative regions thereby increasing
their respective sizes. This leads to fine tuning of positive and negative regions and we gradu-
ally converge to the concept, i.e., the circle (Fig 3(c)).
where P(C|[x]) denotes the conditional probability of a concept C with an equivalence class
[x]. Given that an object x 2 [x], the conditional probability highlights the evaluation of an
object x to be in C. The three rough set regions based on lower and upper approximations are
defined as,
POSða;bÞ ðCÞ ¼ fx 2 UjPðCj½xÞ ag; ð12Þ
The POS(α,β)(C), NEG(α,β)(C) and BND(α,β)(C) in Eqs (12)–(14) are referred to as positive, neg-
ative and boundary regions, respectively. Based on how these thresholds are determined and
interpreted we have different probabilistic rough set models.
To demonstrate the use of three-way decisions for proteins functions classification, we
focus on two probabilistic rough set models, namely, GTRS [25–27] and ITRS [59]. These two
models have at least two advantages over other models.
• Firstly, compared to some of the earlier probabilistic models, such as, 0.5-probabilistic rough
set model and (0.5, β) model, where due to restricted pairs of thresholds, the determination
and interpretation of thresholds are ignored, the GTRS and ITRS allows for investigation
and examination of thresholds based on different aspects.
• Secondly, unlike other models that require user intervention to set the thresholds, such as,
decision-theoretic rough sets and variable precision rough sets, the GTRS and ITRS can be
used to learn and set the thresholds automatically when combined with some typical search
mechanism [25].
For the sake of being complete, we briefly explain and discuss the GTRS and ITRS models.
Three-way decisions using game-theoretic rough sets. The game-theoretic rough sets or
GTRS utilizes game-theoretic formulation to determine thresholds of probabilistic rough sets
[26, 27]. In particular, the thresholds are interpreted based on a tradeooff solution between
multiple criteria employed in a game setting for analyzing rough sets [25–27]. A typical game
in GTRS has three essential components, i.e., game players, strategies and payoff or utility
functions. These components are generally represented as a tuple {P, S, u}. We now explain
each of them.
Game players: The players in the game are denoted by a set P. Generally, there can be n
players in a game. However, for the sake of simplicity, a two player game is commonly consid-
ered in GTRS. Based on the overall game objective and goals, we may have different types of
game players. For instance, in a previous game for analyzing region uncertainty, the players
were defined as the uncertainty of the immediate and deferred decision regions and in another
game that seek for a balanced rough set model, the players of accuracy and generality were
used [25, 60]. In general, the players in the game are selected to reflect the overall purpose of
the game. In GTRS, the players are defined as different aspects and properties of rough sets
based classification and decision making such as accuracy, generality, precision and
uncertainty.
Strategies: Each player in the game participate by playing different strategies. The set of
strategies available to player i is denoted by Si. The Cartesian product of all possible strategy
sets is denoted by S = S1 × S2 × . . . × Sn, where S contains ordered pairs of the form (s1, s2, . . .,
sn) such that s1 2 S1, s2 2 S2 and sn 2 Sn. Each order pair in S is called a strategy profile and rep-
resents a certain situation encountered in a game.
The strategies in GTRS are realized as different changes and modifications in the (α, β)
thresholds. Depending on the initial values of thresholds, we may have different types of strate-
gies. For instance, if the initial values of (α, β) are set to (1, 0.5), then the strategies may be for-
mulated as decreasing levels of α and β. Alternatively, when the initial values of (α, β) are set to
(1, 0), then the strategies may be formulated as decreasing levels of α and increasing levels of β.
Please note that in order to keep the regions disjoint, it is assumed that 0 β < α 1.0. The
strategies of the players in a game lead to effective modification of the thresholds which ulti-
mately determines the final configuration of the thresholds.
Payoff functions: The payoff functions for the players are represented by a set u = (u1, . . .,
un). Each ui is a real valued utility function for player i and it maps the strategy profiles to real
values, i.e., ui: S 7! <. In particular, the payoffs reflect the utilities of performing or selecting a
certain strategy. Recall the game players in GTRS which are represent different aspects or
properties of rough sets, the payoff function for a certain player is based on particular measure
employed for evaluating its respective property.
In a game setting, every player wants to perform a strategy that will maximize its payoff.
The selected strategies of the players however affect their opponents payoffs. The game solu-
tion is used to choose a balanced and trade off point based on the utilities of all the players.
The game solution of Nash equilibrium is commonly used in GTRS for this purpose.
Considering a strategy profile s−i = (s1, s2, . . ., si−1, si+1, . . ., sn), which means a strategy pro-
file without player i strategy. Moreover, the strategy profile (s1, s2, . . ., sn) may be denoted in
revised notation as (si, s−i). The strategy profile (s1, s2, . . ., sn) = (si, s−i) is a Nash equilibrium if
[61],
0 0 0
8i; 8si 2 Si ; ui ðsi ; s i Þ ui ðsi ; s i Þ; whereðsi 6¼ si Þ ð15Þ
This means that for all players i, their respective strategies, i.e., si is the best response to s−i. In
other words, a strategy profile constitutes a Nash equilibrium when no player is benefited from
changing his strategy alone.
The above game description is used in GTRS to formulate a game. However, with a single
one time and non-repeated game, we may not be able to reach effective thresholds that fulfill
the demands of the underlying applications. We need to repeat the game. The essential idea is
to repeatedly modify and refine the thresholds, until we achieve certain performance criteria.
By formulating a game and utilizing the notions such as game solution and repetitive games,
the GTRS seek for an effective configuration of the threshold levels that are employed in the
probabilistic rough sets framework to induce three-way decisions.
Three-way decisions using information-theoretic rough sets. The Information-theo-
retic rough sets (or ITRS) approach the threshold determination issue from the viewpoint of
minimizing the information uncertainty of the probabilistic rough set regions [59]. Let ΔP(α,
β), ΔN(α, β) and ΔB(α, β) denote the overall uncertainties of the probabilistic positive, negative
and boundary regions respectively. The ITRS is based on minimization or optimization of the
following problem.
Please be noted that we used slightly modified notations that were reported in [25]. Eq (16)
suggests that we seek thresholds (α, β) that will minimize the uncertainty of the three regions.
The overall uncertainty in Eq (16) is typically considered as an average uncertainty of the
three regions [59].
where δP(α, β), δN(α, β) and δB(α, β) are the uncertainties of the three regions which may
be computed and interpreted using different measures of uncertainties. Moreover,
P(POS(α,β)(C)), (POS(α,β)(C)) and P(POS(α,β)(C)) are the probabilities of the three regions. Two
measures, i.e., Shannon entropy and gini coefficient are being previously employed for inter-
preting and measuring the uncertainties of the three regions, i.e., δP(α, β), δN(α, β) and
δB(α, β). We now define each of them.
Consider a partition based on a concept C, given by, πC = {C, Cc} and another partition with
respect to the thresholds (α, β), given by, π(α,β) = {POS(α,β)(C), NEG(α,β)(C), BND(α,β)(C)}. The
uncertainty in πC with respect to the three probabilistic regions based on Shannon entropy is
2
dN ða; bÞ ¼ GðpC jNEGða;bÞ ðCÞÞ ¼ 1 PðCjNEGða;bÞ ðCÞÞ
2
ð24Þ
PðCc jNEGða;bÞ ðCÞÞ ;
2
dB ða; bÞ ¼ GðpC jBNDða;bÞ ðCÞÞ ¼ 1 PðCjBNDða;bÞ ðCÞÞ
2
ð25Þ
PðCc jBNDða;bÞ ðCÞÞ :
0 0
5: BND(α0 , β0 )(C) = {x 2 BND(α,β)
S(C)|β < P(C|[x]) < α }
6: POS(α,β)(C) = POS(α0 , β0 )(C)SPOS(α,β)(C)
7: NEG(α,β)(C) = NEG(α0 , β0 )(C) NEG(α,β)(C)
8: BND(α,β)(C) = BND(α0 , β0 )(C)−BND(α,β)(C)
9: else
10: Determine thresholds (α, β) using GTRS and ITRS.
11: POS(α,β)(C) = {x 2 U|P(C|[x]) α}
12: NEG(α,β)(C) = {x 2 U|P(C|[x]) β}
13: BND(α,β)(C) = {x 2 U|β < P(C|[x]) < α}
14: end if
15: return POS(α,β)(C), NEG(α,β)(C), BND(α,β)(C)
The algorithm accepts information table containing information about a new feature and
the three regions based on the previous features, i.e., positive, negative and boundary regions
denoted as POS(α,β)(C), NEG(α,β)(C), and BND(α,β)(C), respectively. In line 1, the algorithm
evaluates the positive and negative regions by employing some quality criteria denoted as
QPOS(α,β) and QNEG(α,β) (representing some quality related aspect of the positive and negative
regions, respectively). These notations are introduced to represent the general notion of any
criteria that is employed for evaluating the three regions. They may be interpreted in terms of
cost, risks, uncertainty, accuracy or precision. The quality of the regions may be measured
based on the notions such as risks, cost, uncertainty, accuracy or precisions. As discussed in
the previous subsection titled Three-way Decisions and Evolving Information, when the fea-
tures evolve, the positive region gradually converges to the concept C (i.e., more precisely
reflect the region representing the concept) and the negative region gradually converges to the
complement of the concept Cc (i.e., more precisely reflect the region not in the concept),
respectively. As a result, the quality of the two regions improves. As improvement in quality is
a gradual process in this case, at the current level of information, the quality of the positive and
negative regions may or may not be effective (please be noted that the term effective here may
have different interpretation based on the underlying applications). We deal with these two
cases separately.
If the quality of the regions are above some acceptable levels c1 and c2, we will only examine
the objects in the boundary region and will not further investigate the positive and negative
regions. The boundary is expected to shrink further as we have access to new features. In any
other case, we will examine the full information table to obtain the three regions. In other
words, we are not satisfied with the quality of the positive and negative regions (they are below
the levels c1 and c2) and we expect that additional information may improve their respective
quality levels. We first deal with the former case. In line 3, we determine thresholds based on
the reduced information table with U = BND(α,β)(C). As new information becomes available in
the form of a new feature, we may be able to confidently classify further objects in the bound-
ary. This is shown in line 4-7 where we further divide the objects in the boundary region. In
line 6-8, we update the three regions based on further examination of the boundary. From line
10-13 we examine the case when the positive and negative regions based on the previous
knowledge were not of acceptable quality. We therefore examine the full information table and
update the three regions accordingly. The Fig 4 represents the essential ideas of the algorithm
1 in diagrammatic form.
It may be noted that the constants c1 and c2 may be defined in different ways depending on
the application needs and requirements. For instance, if we want to reduce the processing
overload, we may define them moderately. On the other hand, if processing overload is not an
issue and we are more concerned about the accuracy, then we may define them more strictly.
Other ways in which they may be defined are by making comparison with the quality of the
regions obtained with the standard Pawlak models or other known models in the domains or
by considering the improvement in quality based on the new features.
Experiment setup
Data preparation
To evaluate the use of three-way approach, we examine the application of three-way decisions
on well studied Saccharomyces cerevisiae species proteins [63, 64], obtained from most widely
used Uniport database [29]. From various classification schemes developed to standardize the
descriptions of protein functions, we chose the state of the art Gene Ontology (GO) [30] classi-
fication scheme. The gene ontology is a structured, controlled vocabulary of protein functions
also called terms. GO terms provide consistency in annotating protein roles in the cellular con-
text. It is arranged in a DAG (please refer to Section Background) structure in which each
node of the graph represents a unique functional term and each term is arranged in a parent
child relationship with other terms. The child term either is a special case of the parent or is a
part of the parent process i.e., a sub-process or component. For the evaluation of our method-
ology we operate on molecular function category of gene ontology. To reveal the evolving
nature of biological information, we present features in the order in which they are evolved
over the time i.e., most basic type of information is presented first and so on [3]. For classifying
a protein into one or more molecular function terms of gene ontology, we retrieve ten different
types of features from varied biological databases. Each feature is helpful in characterizing one
or more functional categories and is represented by the symbol Fi.
Protein Sequence Length (F1): In every cell, genes are converted into proteins via the pro-
cesses of transcription and translation also called the central dogma of molecular biology. The
end product of these processes is a sequence built from twenty amino acids, and is commonly
known as the primary structure of a protein. The amino acid sequence is the most basic type of
information available about a protein, as it can provide concrete evidence about different char-
acteristics of a protein such as its binding sites, sub-cellular localization, structure and
function. To quantify these biological aspects of a protein, we use feature (namely F1), as the
length of protein sequence which is extracted from Uniprot database [29].
Protein Localizations (F2): The location of a protein in the cell can also be associated with
its function. Co-localized proteins are more likely to be part of same molecular activity. Like-
wise, proteins localizing in many different locations can be part of diverse activities. To capture
this aspect, we calculate feature F2 as the number of locations a protein can localize. The pro-
tein localization data is retrieved from the Uniprot database [29].
Biological Processes (F3): A biological process refers to the series of events performed by
one or more assemblies of molecular functions with a defined beginning and end. A protein
participating in many biological processes is more likely to have many molecular level roles.
Thus the number of biological processes of a protein can also be used to capture the molecular
level activities of a protein. As a third feature (F3), we count the number of biological processes
of a protein. It is obtained by retrieving counts of Biological Process ontology terms from the
Gene Ontology database [30].
Number of Interacting Proteins (F4): For calculating the fourth feature of our method,
namely F4, we use genome wide protein-protein interactions (PPI) data to predict proteins
function. In a living cell, protein-protein interactions are amongst the most ubiquitous types
of interactions and their precise knowledge helps in understanding the activities performed by
a protein as well as the processes it is part of. A protein having many different interacting part-
ners can be said to be part of many different functions. Thus the number of interactions (F4) of
a protein can be linked to the wide variety of activities it performs. We obtain PPI data from
most widely used PPI databases, namely, IntAct [65] and STRING [66]. Since protein-protein
interactions databases are noisy, we only consider interactions that are experimentally verified
and are supported by at least two experiments.
Number of Domains (F5): Protein domains are the sequential units that fold in a particular
shape, making independent structures in different proteins. Several classification schemes
have been proposed e.g., [67] to define and demarcate different domains of which some based
on clustering conserved subsequences into related domain families, others on known distinct
structural classes [68]. One of the most famous and widely used domain classification schema
is the Interpro database [69]. InterPro database contains diagnostic signatures of protein
sequences consisting of models e.g., regular expressions models, Hidden Markov Models etc.,
which describe protein domains found within sequence. Domains are the most important fea-
ture among relevant sequence features of a protein that associate it to a particular kind of func-
tions. To integrate domain relevance we also use as a feature (namely F5) the number of
Interpro conserved domains within a query protein sequence.
Number of Conserved Motifs (F6): A motif is a conserved amino acid sequence pattern in
a protein sequence that may be associated to a specific function. These subsequences may
often contain small “gaps” of fixed or variable lengths among amino acids of the subsequence.
The knowledge of exact patterns of motifs and their functions is helpful in the understanding
of structure and function of related proteins in which such motifs may appear. For example, if
a motif of a certain family is present in a protein sequence then it will make it highly probable
to functionally associate that protein with the functions of that motif i.e., we can associate pro-
teins with functions by merely checking the presence of certain motifs. Thus in our technique,
as sixth feature (F6) we count the number of conserved motifs in a protein sequence using Pro-
site motif database [70].
Number of Protein Structures (F7): A protein’s primary structure consists of sequence of
amino acids. These amino acids due to their varied physical and chemical properties as well as
the presence of different participant cellular forces, assumes a unique configuration in three-
dimensional space. This stable configuration of proteins is also called the tertiary structure of
proteins. This final configuration or structure of a protein is strongly correlated to its function,
because in many biological processes, the interacting proteins have to come into physical con-
tact in order to accomplish the desired function. The structure of a protein also determines
many of its functional characteristics, for example its inter-facial binding sites, the specific
ligands it binds to, cellular localizations, as well as other proteins it can interact with. Among
all the structural databases PDB (Protein Data Bank) [71] is by far the most reliable, wide-rang-
ing as well as popular repository for experimentally derived protein 3D structures. We query
the PDB database to obtain the number of experimentally determined structures associated to
a protein under investigation and use this information as a feature (namely F7) to characterize
its function.
Molecular Weight of Protein (F8): Although weight of a protein is not strongly related to
its function but in some cases it can be used to generally group them into broader functional
categories. We retrieve Molecular weight rounded to the nearest mass unit (Dalton) from Uni-
prot Database [29] and use it as a feature (namely F8) for our 3-way classifier.
Number of Interfacing Residues in Protein Structure (F9): Many proteins bind together
and form multi-protein complexes. Different proteins in the complex perform different func-
tions. These functions are associated with the number of residues on a protein’s interface that
enables it to stabilize, bind and form complexes. Owing to the significance of interfacing resi-
dues we utilize a structural feature i.e., the number of residues on the protein’s interface to
characterize function of a protein. The interfacing residues can vary for various functional
activities. To capture this aspect we used PDBe PISA server [72], to retrieve the number of pre-
dicted interfacing residues and use it as feature (namely F9) for our 3-way classifier.
Binding sites in the Predicted Interface (F10): A protein’s physical interaction with other
molecules, determines its biological activities. For example antibody proteins selectively bind
to viruses or bacteria to choose them for destruction, the hexokinase protein binds to ATP
molecule as well as with glucose molecule in order to catalyze their chemical reaction, and so
on. Without any doubt almost all proteins stick, or bind, to other molecules in order to per-
form their activities at molecular level. Some proteins bind very tightly while others bind for a
short period of time depending on their specificity as well as the molecular task they have to
perform. Each protein can usually bind to one or few other molecules determined by the
nature of binding residues (also called binding sites) at its surface. To determine the specificity
of a protein for binding and performing wide variety of functions we calculate a feature
(namely (F10), which is the number of binding sites on its surface that are predicted using
PDBeFold Server [73].
The above features namely, F1 to F10, are extracted using the Feature Extractor module
(already described in Section An Architecture of Protein Function Classification with
Three-way Decisions), from the world wide biological databases using the knowledge discov-
ery module. The Feature Extractor module also has the capability to incorporate any new fea-
ture, say F11 in the predictive task. To imitate the ever evolving nature of biological
information, we selected and ranked features from most basic type to the latest type i.e, F1
namely, sequence similarity, is a basic type of feature and F10 namely, number of binding sites
on a protein interface, is a specific feature known after information evolution.
The three approaches with GTRS are based on different games that are formulated based
on description in Section Three-way Decisions using GTRS. The essential difference in these
games are the consideration of different types of game players. Two of these games are based
on examining a balance between the uncertainties of probabilistic rough set regions. These
games are based on two players, namely, immediate decision region, denoted as I and deferred
decision region, denoted as D. The player I reflects the collective uncertainty in probabilistic
positive and negative regions and the player D denotes the uncertainty in the probabilistic
boundary region. By realizing changes in thresholds as game strategies, the players in a game
compete in a game by selecting appropriate changes in the thresholds which are used in deter-
mining the final settings of the thresholds. Two games are constructed with these game players,
i.e., player I and D by realizing different interpretation and computation of uncertainty. In one
game, the uncertainty is measured with the Shannon entropy and in another game it is mea-
sured with gini coefficient. These two games will be referred to as GTRSE and GTRSG, respec-
tively. These game were previously examined in the context of text categorization and medical
decision making [25, 62, 74]. The third game in GTRS is based on determining a trade off
between two aspects of rough sets based classification, namely, accuracy and generality. This
game was previously examined in the context of recommender systems in [60]. We will refer
to this game as GTRS(A,G).
Two approaches are considered with the ITRS. These two approaches are ITRS based on
Shannon entropy and ITRS based on Gini coefficient as discussed in Section Three-way Deci-
sions using ITRS. We denote these approaches as ITRSE and ITRSG, respectively. Both of
these measures interpret the uncertainty in a different way and therefore will lead to different
thresholds.
In all experiments, we considered the top five most frequent protein functions in the data-
base. For each protein function (recall that each protein function is considered as a category),
we learn the probabilistic thresholds (α, β) and performed three-way decisions using the five
approaches discussed above in this section. We considered four feature sets. In each feature
set, we consider the features whose relevant information was previously available or which
emerged roughly at the same time. In particular, the first feature set comprise of F1, F2 and F3
(please refer to Section Data Preparation for their details). We denote the first feature set as
FS1. The second feature set denoted as FS2, is given by FS1 [ F4. The third and fourth feature
sets, denoted as FS3 and FS4 are given by FS3 = FS2 [ {F5, F6} and FS4 = FS3 [ {F7, F8, F9, F10},
respectively. Please be noted that the FS1 contains the oldest available information about pro-
teins while FS4 is the represents the most recent information comprising the previous knowl-
edge and newly evolved information. Finally, all the results are based on 10 folds cross
validation.
S
j POSða;bÞ ðCÞ NEGða;bÞ ðCÞj
Generalityða; bÞ ¼ ; ð27Þ
jUj
The accuracy highlights the relative number of correct classification decisions for the objects
in the universal set and the generality reflects the relative number of objects for whom classifi-
cation decisions can be made. Table 3 shows the results obtained with the GTRS based
approaches. The rows of the table correspond to the results obtained with a particular set of
features and the columns correspond to results of accuracy and generality for different
approaches. The best results for accuracy and generality against each approach is presented in
bold. We may note that the best results for the generality for the three approaches are against
the highest feature set size. Moreover, the generality of the three approaches improve as the
feature set size is increased. In particular, the generality of GTRS(A,G) with lowest feature size is
23.77% and the highest feature set size is 68.75%. This represents a total increase of 44.98% in
generality. For the other two approaches, i.e., GTRSE and GTRSG, similar increases in general-
ity with values 45.75% and 39.78% are noted based on the lowest feature set size and highest
feature set size. Since the features represent the available level of information for predicting
protein functions. We may conclude from these results that as the level of information
improves (i.e., as we include more features), we are able to make classification decisions for
more proteins.
Let us now look at the results of accuracy in Table 3. We may observe that in general, the
values of accuracy decrease slightly as we move from lower to higher feature set sizes. How-
ever, compared to the generality, we do not have significant different between these values. For
the three approaches, i.e., GTRS(A,G), GTRSE and GTRSG, the differences between the values of
accuracy for the lowest and highest feature set sizes are 2.16%, 1.41% and 1.62%, respectively.
From the results of accuracy and generality, we may notice that by increasing the number of
features or the level of information, we are able to make more decisions while mainlining the
same or similar level of accuracy.
Table 4 shows the results obtained with the ITRS based approaches. The increase in general-
ity for the two approaches, i.e., ITRSE and ITRSG between the lowest feature set size and high-
est feature set sizes are 14.36% and 18.29%, respectively. The accuracy values for the ITRSE and
ITRSG approaches are decreased by a small 3.37% and 2.74%, respectively as we increase the
feature set size. comparing these results with the GTRS based approaches, we may note that
the generality values of the ITRS approaches are significantly better than those obtained with
the GTRS based approaches. However, for accuracy there is no significant different between
ITRS and GTRS based approaches as both of them are around 80%. Despite some differences
in the results with the two approaches, the key observation noted earlier in the discussion of
GTRS based results holds for the results in the case of ITRS as well. We may notice again that
increasing the number of features or the level of information lead to better generality (which
implies more classification decisions) while maintaining the same or similar level of accuracy.
In order to highlight this observation, we constructed two figures.
Fig 5 shows the results of the positive, negative and boundary regions based on the GTRS
and ITRS based approaches. Each bar in the figure is split into three parts, representing the
positive, negative and boundary regions respectively. Each set of four bars corresponds to a
particular approach and is separated by a large space. The four bars are placed in increasing
order of feature set sizes. In each set of four bars, the leftmost bar corresponds to the least fea-
ture set size and the rightmost bar corresponds to the highest feature set size. We may note in
Fig 5, that as we increase the feature set sizes, the positive and negative regions grow in size
while the boundary regions shrinks. According to the definition of generality in Eq (27), the
union of the positive and negative regions represents the generality. This figure highlights the
same fact, noted earlier in the previous discussion, i.e., we are able to make more classification
decisions for proteins as the level of information increases (or number of features increases).
Please be noted that in probabilistic rough sets, it is not always necessary that the addition of
features will increase the positive and negative regions. However, we want to emphasize the
fact that it will result in improvement in the quality of the regions.
Figs 6 and 7 summarizes the results of accuracy and generality for the considered
approaches. The green colour in these figures represent the accuracy and the red colour indi-
cate the generality. The values of accuracy and generality are reported for all four feature sets
described in the previous section. It may be noted that for all the approaches, the generality
improves as we use higher number of features. However, on the other hand the accuracy is not
affected significantly. This means that by increasing the features, we are able to improve the
generality while maintaining the similar level of accuracy.
Table 5. Comparison of the proposed three way classification method with top performing methods
of the field. The target classes comprise of broader gene ontology terms for Saccharomyces cerevisiae spe-
cies proteins.
Method’s Name Generality Accuracy
Three way decision using GTRS 68% 78.40%
Three way decision using ITRS 74% 79.2%
INGA (Interaction Network GO Annotator) tool [75] 60% 57%
Jones-UCL [76] 62% 59.5%
Argot [76] 61% 59.4%
BLAST Annotation Transfer (baseline method) [76] 78% 38%
doi:10.1371/journal.pone.0171702.t005
method outperforms IGNA in both aspects. The strength of our method mainly comes from
the fact that it is able to defer instances (i.e., proteins) for which there is less characterization
evidence at present, thus improving prediction accuracy.
In order to get further insight into the relative performance, we also consider some results
that were reported on similar problem. The most recent and well known schemes proposed for
the prediction of protein functions are evaluated in the CAFA (Critical Assessment of protein
Function Annotation) challenge [76]. The CAFA challenge is conducted after every two years
to have comparative evaluation of top schemes for the prediction of protein functions. One of
the best sequence alignment algorithm (i.e., BLAST), which is also used as a baseline scheme
for annotation transfer, achieved an accuracy of 38% during the CAFA challenge, when tried
on molecular function category of GO [76]. Likewise, the top schemes of the challenge have
reported to have an accuracy of 59.5% and 59.4%, respectively [76], when tested against hetero-
geneous ontology classes. On the other hand, our method have achieved an overall accuracy of
80% when tried for the same target classes, depicting a significant gain in terms of prediction
accuracy.
Another method, that is more recently proposed, by Mitrofanova et al. in [8], combines
inter-species homology data for protein function prediction and reported an accuracy of
97.7% when tried on Saccharomyces cerevisiae proteins. However, the results of this method
cannot be directly compared with our approach for a number of reasons. Firstly, this method
operates on fixed ontology sizes thus giving results for only 16 GO terms (target classes) out of
more than 30,000 GO terms. Secondly, the fixed GO terms chosen by the authors limit perti-
nence of their method to proteins directly annotated to those GO terms, hence limiting the
applicability of their algorithm to only a small number of proteins (hence, significant reduction
in generality). On the other hand, our algorithm has much wider GO coverage and results pre-
sented include all the yeast proteins.
As a final remark, it is pertinent to mention here that although three way classification
achieved far better results than the earlier proposed schemes but the main purpose of this
study was not the optimization of performance (in terms of precision or accuracy) of the ear-
lier schemes. But rather it should be looked at from the perspective of an examination, feasibil-
ity and appropriateness of considering three way classification schemes based on evolving
biological information for the task of protein function predictions.
In conclusion, the three way approaches considered in this study achieve an average accu-
racy of 80%. The incorporation of future information is useful as it improves the generality or
applicability of the models while maintaining similar level of accuracy. In particular, there is
an increase in generality by more than 40% for the GTRS based approaches and more than
14% for the ITRS based approaches. From the general trend in the results, it is suggested that
as more information becomes available, the generality may improve further. These results
advocate for the use of three-way approaches for protein functions classification.
Conclusion
Proteins are involved in almost every biological phenomena and the precise knowledge of
their functions plays an essential role in understanding biological processes. Intelligent mecha-
nisms are generally employed to assign and predict functions of proteins. The technological
advancements are continuously resulting in new information and features describing protein
functions which in turn can be utilized for improving the quality of protein function predic-
tions. An important issue in this context is to develop effective classification schemes and
models for classifying protein functions by incorporating evolving information. We propose a
three-way decision making approach to address this issue. The approach includes a deferment
decision option which is practiced in situations characterized by insufficient and incomplete
information. In particular, we considered probabilistic rough sets based models i.e., game-the-
oretic rough sets and information-theoretic rough sets for inducing and making three-way
decisions. An architecture of protein function classification with three-way decisions is also
proposed and explained. Experimental results on dataset from Uniprot database indicate that
as the level of biological information increases, the number of deferred cases are reduced while
maintaining similar level of accuracy. In particular, an average accuracy of 80% (±%2) was
reported for the considered approaches with an average generality improvement of 33% (±%5)
as we increase features.
We investigated the probabilistic rough sets which is one possible way for inducing three-
way decisions. Other approaches such as shadowed sets, statistical testing, interval sets and
ortho-pairs may also be examined to investigate the potential benefits of three-way approach
to protein function classification. Moreover, the three-way approach for protein function clas-
sification may further be evaluated and extended by incorporating new features resulting from
next generation sequencing data or from other high throughput experiments.
Supporting information
S1 File. “S1_File.zip”. The code (Python/Bash/Matlab) and data files along with instructions
are provided as a zip file.
(ZIP)
Acknowledgments
The third author was partially supported by Discovery Grant from NSERC Canada. We are
also thankful to the Higher Education Commission of Pakistan for providing student grant
under the SRGP (Startup Research Grant Program) initiative. There was no additional external
funding received for this study.
Author Contributions
Conceptualization: HU NA JY AB.
Formal analysis: HU NA JY AB.
Investigation: HU NA JY AB.
Methodology: HU NA JY AB.
References
1. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 5th edition.
Anderson M, Granum S, editors. Garland Science; 2008.
2. Benso A, Di Carlo S, Ur Rehman H, Politano G, Savino A, Suravajhala P. A Combined Approach for
Genome Wide Protein Function Annotation/Prediction. PROTEOME SCIENCE. 2013;Vol. 11(No.
S1):1–12. doi: 10.1186/1477-5956-11-S1-S1
3. Panday G, Kumar V, Steinbach M. Computational Approaches for Protein Function Prediction: A Sur-
vey; 2006.
4. Jiang Y, Ronnen Oron T, Ur Rehman H, et al. An expanded evaluation of protein function prediction
methods shows an improvement in accuracy. Genome Biology. 2016; 17(184). doi: 10.1186/s13059-
016-1037-6 PMID: 27604469
5. Ur Rehman H, Zafar U, Benso A, Islam N. A Structure based Approach for Accurate Prediction of Pro-
tein Interactions Networks. In: Proceedings of the 9th International Joint Conference on Biomedical
Engineering Systems and Technologies. vol. 3. BIOINFORMATICS 2016. ScitePress; 2016. p. 237–
244.
6. Benso A, Di Carlo S, Ur Rehman H, Politano G, Savino A, Vasciaveo A. Using Boolean networks to
model post-transcriptional regulation in gene regulatory networks. Journal of Computational Science.
2014; 5(3):332–344. doi: 10.1016/j.jocs.2013.10.005
7. Benso A, Di Carlo S, Ur Rehman H, Politano G, Savino A, Vasciaveo A. Accounting for post-transcrip-
tional regulation in boolean networks based regulatory models. International Work-Conference on Bio-
informatics and Biomedical Engineering, IWBBIO 2013. Copicentro Editorial; 2013. p. 397–404.
8. Mitrofanova A, Pavlovic V, Mishra B. Prediction of Protein Functions with Gene Ontology and Interspe-
cies Protein Homology Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics.
2011; 8 no. 3:775–784. doi: 10.1109/TCBB.2010.15 PMID: 21393654
9. Ur Rehman H, Benso A, Di Carlo S, Politano G, Savino A, Suravajhala P. Combining Homolog and
Motif Similarity Data with Gene Ontology Relationships for Protein Function Prediction. In: Bioinformat-
ics and Biomedicine (BIBM), 2012 IEEE International Conference; 2012. p. 1–4.
10. Zhang H, Chen Z, Liu Z, Zhu Y, Wu C. Location Prediction Based on Transition Probability Matrices
Constructing from Sequential Rules for Spatial-Temporal K-Anonymity Dataset. PloS one. 2016; 11(8):
e0160629. doi: 10.1371/journal.pone.0160629 PMID: 27508502
11. Stephen F, Gish W, Miller W, Eugene W, David J. Basic local alignment search tool (BLAST). Molecular
Biology. 1990; 215:403–410. doi: 10.1016/S0022-2836(05)80360-2
12. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. Detecting protein function and
protein-protein interactions from genome sequences. Science. 1999; 285:751–753. doi: 10.1126/
science.285.5428.751 PMID: 10427000
13. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proceedings of the
National Academy of Sciences of the United States of America. 1988; 85(8):2444–8. doi: 10.1073/pnas.
85.8.2444 PMID: 3162770
14. Watson J, Laskowski R, Thornton J. Predicting protein function from sequence and structural data. Cur-
rent Opinion in Structural Biology. 2005; 15:275–284. doi: 10.1016/j.sbi.2005.04.003 PMID: 15963890
15. Gaudet P, Livstone MS, Lewis SE, Thomas PD. Phylogenetic-based propagation of functional annota-
tions within the Gene Ontology consortium. Briefings in Bioinformatics. 2011; 12:449–462. doi: 10.
1093/bib/bbr042 PMID: 21873635
16. Pellegrini M, Marcotte E, Thompson M, D E, TO Y. Assigning protein functions by comparative genome
analysis: protein phylogenetic profiles. Proceedings of the National Academy of Sciences of the United
States of America. 1999; p. 4285–4288.
17. Deng M, Zhang K, Mehta S, Chen T, Sun F. Prediction of protein function using protein-protein interac-
tion data. Journal of Computational Biology. 2003; 10:947–960. doi: 10.1089/106652703322756168
PMID: 14980019
18. Letovsky S, Kasif S. Predicting protein function from protein-protein interaction data: a probabilistic
approach. Bioinformatics. 2003; 19(suppl. 1):i197–i204. doi: 10.1093/bioinformatics/btg1026 PMID:
12855458
19. Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M. Whole-proteome prediction of protein function via
graph-theoretic analysis of interaction maps. Bioinformatics. 2005; 21((suppl. 1)):i302–i310. doi: 10.
1093/bioinformatics/bti1054 PMID: 15961472
20. Vazquez A, Flammini A, Maritan A, Vespignani A. Global protein function prediction from protein-protein
interaction networks. Nature Biotechnology. 2003; 21:697–700. doi: 10.1038/nbt825 PMID: 12740586
21. Laskowski RA, Watson JD, Thornton JM. Protein function prediction using local 3D templates. Journal
of Molecular Biology. 2005; 351:614–626. doi: 10.1016/j.jmb.2005.05.067 PMID: 16019027
22. Pal D, Eisenberg D. Inference of protein function from protein structure. Structure. 2005; 13:121–130.
doi: 10.1016/j.str.2004.10.015 PMID: 15642267
23. Pazos F, Sternberg MJ. Automated prediction of protein function and detection of functional sites from
structure. Proceedings of the National Academy of Sciences of the United States of America. 2004;
101:14754–14759. doi: 10.1073/pnas.0404569101 PMID: 15456910
24. Yao YY. Probabilistic rough set approximations. International Journal of Approximate Reasoning. 2008;
49(2):255–271. doi: 10.1016/j.ijar.2007.05.019
25. Azam N, Yao JT. Analyzing Uncertainties of Probabilistic Rough Set Regions with Game-theoretic
Rough Sets. International journal of approximate reasoning. 2014; 55(1):142–155. doi: 10.1016/j.ijar.
2013.03.015
26. Herbert JP, Yao JT. Game-theoretic Rough Sets. Fundamenta Informaticae. 2011; 108(3-4):267–286.
27. Yao JT, Herbert JP. A Game-theoretic Perspective on Rough Set Analysis. Journal of Chongqing Uni-
versity of Posts and Telecommunications (Natural Science Edition). 2008; 20(3):291–298.
28. Deng XF, Yao YY. A Multifaceted Analysis of Probabilistic Three-way Decisions. Fundamenta Informa-
ticae. 2014; 132:291–313.
29. The UniProt Consortium. UniProt: a hub for protein information; 2015. Nucleic Acids Res. 43: D204–
D212. doi: 10.1093/nar/gku989 PMID: 25348405
30. GO. The Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids
Research. 2015; 43(Database Issue):D1049–D1056. doi: 10.1093/nar/gku1179 PMID: 25428369
31. Francois E, Karsten S, Claverie J. Phydbac “Gene Function Predictor”: a gene annotation tool based on
genomic context analysis. BMC Bioinformatics. 2005; 6(1):247. doi: 10.1186/1471-2105-6-247
32. Pellegrini M, Marcotte E, Thompson M, Eisenberg D, Yeates T. Assigning protein functions by compar-
ative genome analysis: Protein phylogenetic profiles. Proceedings of the National Academy of Sci-
ences. 1999; 96(8):4285–4288. doi: 10.1073/pnas.96.8.4285
33. The NCBI handbook [Internet] Bethesda (MD): National Library of Medicine (US), National Center for
Biotechnology Information. Chapter 18, The Reference Sequence (RefSeq) Project.; 2002.
34. Fields S, Song O. A Novel Genetic System to Detect Protein-Protein Interactions. Nature. 1989;
340:245–246. doi: 10.1038/340245a0 PMID: 2547163
35. Andreas Bauer BK. Affinity Purification-Mass Spectrometry. European Journal of Biochemistry. 2003;
270:570–578. doi: 10.1046/j.1432-1033.2003.03428.x PMID: 12581197
36. Ur Rehman H, Benso A, Di Carlo S, Politano G, Savino A, Suravajhala P. Using gnome wide data for
protein function prediction by exploiting gene ontology relationships. IEEE Press; 2012. p. 497–502.
37. Liang DC, Liu D, Kobina A. Three-way group decisions with decision-theoretic rough sets. Information
Sciences. 2016; 345:46–64. doi: 10.1016/j.ins.2016.01.065
38. Peters J, Ramanna S. Proximal three-way decisions: theory and applications in social networks. Knowl-
edge-Based Systems. 2016; 91:4–15. doi: 10.1016/j.knosys.2015.07.021
39. Yao YY. Rough Sets and Three-Way Decisions. In: Proceedings of 10th International Conference on
Rough Sets and Knowledge Technology (RSKT’15), Lecture Notes in Computer Science 9436; 2015.
p. 62–73.
40. Baram Y. Partial Classification: The Benefit of Deferred Decision. IEEE Transactions on Pattern Analy-
sis and Machine Intelligence. 1998; 20(8):769–776. doi: 10.1109/34.709564
41. Goudey R. Do statistical inferences allowing three alternative decisions give better feedback for
environmentally precautionary decision-making? Journal of Environmental Management. 2007; 85
(2):338–344. doi: 10.1016/j.jenvman.2006.10.012 PMID: 17129664
42. Pauker SG, Kassirer JP. The threshold approach to clinical decision making. The New England Journal
of Medicine. 1980; 302(20):1109–1117. doi: 10.1056/NEJM198005153022003 PMID: 7366635
43. Sherif M, Hovland CI. Social judgment: Assimilation and contrast effects in communication and attitude
change. 1961;.
44. Tversky A, Shafir E. Choice under conflict: The dynamics of deferred decision. Psychological science.
1992; 3(6):358–361. doi: 10.1111/j.1467-9280.1992.tb00047.x
45. Yao YY. An Outline of a Theory of Three-way Decisions. In: Proceedings of Rough Sets and Current
Trends in Computing (RSCTC’12), Lecture Notes in Computer Science 7413; 2012. p. 1–17.
46. Yao JT, Zhang Y. A scientometrics study of rough sets in three decades. In: Proceedings of 8th Interna-
tional Conference on Rough Sets and Knowledge Technology (RSKT’13), Lecture Notes in Computer
Science 8171; 2013. p. 28–40.
47. Deng XF, Yao YY. Decision-theoretic three-way approximations of fuzzy sets. Information Sciences.
2014; 279:702–715. doi: 10.1016/j.ins.2014.04.022
48. Liang DC, Liu D, Pedrycz W, Hu P. Triangular fuzzy decision-theoretic rough sets. International Journal
of Approximate Reasoning. 2013; 54(8):1087–1106. doi: 10.1016/j.ijar.2013.03.014
49. Liang DC, Liu D. Systematic studies on three-way decisions with interval-valued decision-theoretic
rough sets. Information Sciences. 2014; 276:186–203. doi: 10.1016/j.ins.2014.02.054
50. Liang DC, Xu Z, Liu D. Three-way decisions with intuitionistic fuzzy decision-theoretic rough sets based
on point operators. Information Sciences. 2017; 375:183–201. doi: 10.1016/j.ins.2016.09.039
51. Liu D, Li TR, Li HX. Interval-valued decision-theoretic rough sets. Computer Science. 2012; 39(7):178–
181.
52. Pedrycz W. Shadowed sets: representing and processing fuzzy sets. IEEE Transactions on Systems,
Man, and Cybernetics, Part B: Cybernetics. 1998; 28(1):103–109. doi: 10.1109/3477.658584
53. Wald A. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics. 1945; 16
(2):117–186. doi: 10.1214/aoms/1177731118
54. Pawlak Z. Rough Sets. International Journal of Computer and Information Sciences. 1982; 11:241–
256. doi: 10.1007/BF01001956
55. Pawlak Z. Rough sets: theoretical aspects of reasoning about data. Kluwer Academic; 1991.
56. Yao YY, Greco S, Slowinski R. Probabilistic rough sets. In: Handbook of Computational Intelligence,
Projektorganisation und Management im Software Engineering; 2015. p. 315–339.
57. Yao YY, Wong SKM, Lingrass P. A decision-theoretic rough set model. Methodologies for Intelligent
Systems. 1990; 35:17–24.
58. Yao YY. Two Semantic Issues in a Probabilistic Rough Set Model. Fundamenta Informaticae. 2011;
108(3-4):249–265.
59. Deng XF, Yao YY. An Information-theoretic interpretation of thresholds in probabilistic rough sets. In:
Proceedings of Rough Sets and Current Trends in Computing (RSCTC’12), Lecture Notes in Computer
Science 7413; 2012. p. 232–241.
60. Azam N, Yao JT. Game-theoretic rough sets for recommender systems. Knowledge-Based Systems.
2014; 72:96–107. doi: 10.1016/j.knosys.2014.08.030
61. Leyton-Brown K, Shoham Y. Essentials of Game Theory: A Concise Multidisciplinary Introduction. Mor-
gan & Claypool Publishers; 2008.
62. Zhang Y. Optimizing Gini coefficient of probabilistic rough set regions using Game-Theoretic Rough
Sets. In: Proceedings of 26th IEEE Canadian Conference on Electrical and Computer Engineering
(CCECE’13); 2013. p. 699–702.
63. Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, et al. Functional profiling of the Saccharo-
myces cerevisiae genome. nature. 2002; 418(6896):387–391. doi: 10.1038/nature00935 PMID:
12140549
64. DiCarlo JE, Norville JE, Mali P, Rios X, Aach J, Church GM. Genome engineering in Saccharomyces
cerevisiae using CRISPR-Cas systems. Nucleic acids research. 2013; p. gkt135.
65. Kerrien S, et al. The IntAct molecular interaction database in 2012. Nucleic Acids Research. 2012; 40
(Database issue):D841–D846. doi: 10.1093/nar/gkr1088 PMID: 22121220
66. Jensen L, Kuhn M, Stark M, et al. STRING 8–a global view on proteins and their functional interactions
in 630 organisms. Nucleic Acids Research. 2009; 37(Database Issue):D412–6. doi: 10.1093/nar/
gkn760 PMID: 18940858
67. Mulder NJ, et al. New developments in the InterPro database. Nucleic Acids Research. 2007; 35:D224–
D228. doi: 10.1093/nar/gkl841 PMID: 17202162
68. Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of
sequences and structures. Journal of Molecular Biology. 1995; 247:536–540. doi: 10.1016/S0022-2836
(05)80134-2 PMID: 7723011
69. Mitchell A, Chang HY, Daugherty L, et al. InterPro protein families database: the classification resource
after 15 years. Nucleic Acids Research. 2015; 43(Database Issue):D213–21. doi: 10.1093/nar/gku1243
PMID: 25428371
70. Hulo N, Bairoch A, Bulliard V, Cerutti L, et al. The PROSITE Database. Nucleic Acids Research. 2006;
34:D227–230. doi: 10.1093/nar/gkj063 PMID: 16381852
71. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Databank.
Nucleic Acids Research. 2000; 28:235–242. doi: 10.1093/nar/28.1.235 PMID: 10592235
72. Krissinel E, Henrick K. Inference of macromolecular assemblies from crystalline state. Journal of Molec-
ular Biology. 2007; 372:774–797. doi: 10.1016/j.jmb.2007.05.022 PMID: 17681537
73. Krissinel E, Henrick K. Secondary-structure matching (SSM), a new tool for fast protein structure align-
ment in three dimensions. Acta Crystallographica Section D. 2004; D60:2256–2268. doi: 10.1107/
S0907444904026460 PMID: 15572779
74. Yao JT, Azam N. Three-way Decision Making in Web-based Medical Decision Support Systems with
Game-theoretic Rough Sets. IEEE Transactions on Fuzzy Systems. 2014; 23(1):3–15. doi: 10.1109/
TFUZZ.2014.2360548
75. Piovesan D, Giollo M, Leonardi E, Ferrari C, C E Tosatto S. INGA: protein function prediction combining
interaction networks, domain assignments and sequence similarity. Nucleic acids research. 2015; 43
(W1):W134–40. doi: 10.1093/nar/gkv523 PMID: 26019177
76. Radivojac P, Clark WT, Oron TR, Schnoes, et al. A large-scale evaluation of computational protein func-
tion prediction. Nature methods. 2013; 10(3):221–227. doi: 10.1038/nmeth.2340 PMID: 23353650