0% found this document useful (0 votes)
35 views

Qsar With Python

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Qsar With Python

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Halder 

and Dias Soeiro Cordeiro 
J Cheminform (2021) 13:29 Journal of Cheminformatics
https://doi.org/10.1186/s13321-021-00508-0

SOFTWARE Open Access

QSAR‑Co‑X: an open source toolkit


for multitarget QSAR modelling
Amit Kumar Halder* and M. Natália Dias Soeiro Cordeiro*   

Abstract 
Quantitative structure activity relationships (QSAR) modelling is a well-known computational tool, often used in a
wide variety of applications. Yet one of the major drawbacks of conventional QSAR modelling is that models are set
up based on a limited number of experimental and/or theoretical conditions. To overcome this, the so-called mul-
titasking or multitarget QSAR (mt-QSAR) approaches have emerged as new computational tools able to integrate
diverse chemical and biological data into a single model equation, thus extending and improving the reliability of this
type of modelling. We have developed QSAR-Co-X, an open source python–based toolkit (available to download at
https://​github.​com/​ncord​eirfc​up/​QSAR-​Co-X) for supporting mt-QSAR modelling following the Box-Jenkins moving
average approach. The new toolkit embodies several functionalities for dataset selection and curation plus computa-
tion of descriptors, for setting up linear and non-linear models, as well as for a comprehensive results analysis. The
workflow within this toolkit is guided by a cohort of multiple statistical parameters and graphical outputs onwards
assessing both the predictivity and the robustness of the derived mt-QSAR models. To monitor and demonstrate the
functionalities of the designed toolkit, four case-studies pertaining to previously reported datasets are examined here.
We believe that this new toolkit, along with our previously launched QSAR-Co code, will significantly contribute to
make mt-QSAR modelling widely and routinely applicable.
Keywords:  QSAR, Multitarget models, Software tools, Feature selection, Machine learning

Introduction witnessed several transformations in the field of QSAR


Quantitative Structure–Activity Relationships (QSAR) modelling, owing to the progress in model development
modelling is one of the most frequently employed in strategies, data mining techniques, validation methodol-
silico techniques for chemical data mining and analysis. ogies, along with machine learning and statistical analy-
Though QSAR has been introduced more than 50  years sis tools [3, 4]. Nevertheless, the quest for new modelling
ago, it remains as an efficient technique for build- strategies is still ongoing to further improve the overall
ing mathematical models to find out crucial structural efficacy of QSAR modelling [1, 5, 6]. For example, one of
requirement for targeting specific response variables (e.g., the major limitations of conventional QSAR is that mod-
activity, toxicity, physicochemical properties, etc.). At els are developed for the response variable(s), regardless
the same time, QSAR provides one of the most effective of the experimental (or theoretical) conditions followed
strategies for predicting properties of new chemicals and to obtain such response variable(s). In reality however,
also for identifying potential hits through virtual screen- the researchers come across data-points pertaining to
ing of chemical libraries [1, 2]. The last few decades have various experimental and/or theoretical conditions, the
inclusion of which may significantly improve the scope of
QSAR modelling. This has paved the way to unconven-
*Correspondence: [email protected]; [email protected] tional computational modelling approaches, so-called
LAQV@REQUIMTE/Faculty of Sciences, University of Porto, 4169‑007 Porto, multitasking, or multitarget QSAR (mt-QSAR), which
Portugal

© The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creat​iveco​
mmons.​org/​licen​ses/​by/4.​0/. The Creative Commons Public Domain Dedication waiver (http://​creat​iveco​mmons.​org/​publi​cdoma​in/​
zero/1.​0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 2 of 18

are able to integrate data under different conditions into most statistically significant (or optimised) model. Also,
a single model equation for simultaneous prediction of due to this randomisation step, the models generated by
the targeted response variable(s) [7–9]. Therefore, the GA-LDA lack reproducibility. As such, both FS and SFS
interest of QSAR practitioner researchers over such mt- techniques are more straightforward and reproducible,
modelling has been growing steadily [1, 5]. In particular, allowing the swift establishment of linear discriminant
mt-QSAR modelling techniques based on the Box-Jen- models. Finally, simultaneous application of GA with the
kins moving average approach have already proved to two newly implemented feature selection algorithms can
be highly efficient in dealing with datasets pertaining help finding a greater number of LDA models, thereby
to multiple conditions [10–14]. Our group has recently increasing the possibility of consensus modelling. Addi-
developed an open source standalone software “QSAR- tionally, the QSAR-Co-X software provides significant
Co” (https://​sites.​google.​com/​view/​qsar-​co) [15] to set modifications as far as strategies for the development of
up classification-based QSAR models. Briefly, QSAR-Co non-linear models are concerned. First of all, it comprises
enables users to set up linear or non-linear classification a toolkit for building non-linear models by resorting to
models, by resorting to the Genetic Algorithm based six different machine learning (ML) algorithms. One of
Linear Discriminant Analysis (GA-LDA) [16, 17] or to its modules assists in tuning hyperparameters of such
the Random Forests (RF) [18] classifier, respectively. As ML tools (not included in QSAR-Co [15]) for achieving
per our experience so far, mt-QSAR modelling is highly optimised models. As an alternative, a separate mod-
sensitive to the strategies used for model development ule is available for setting up user-specific parameters
especially because the number of starting descriptors meant to a rapidly development of non-linear models.
increases depending on the number of experimen- Alike QSAR-Co, model development in QSAR-Co-X is
tal (and/or theoretical) conditions. The possibility of guided by descriptor pre-treatment, two-stage exter-
employing a larger range of development strategies will nal validation, and determination of the applicability
definitely improve the usefulness and scope of such mt- domain of linear and non-linear models. Still the QSAR-
QSAR modelling. The present work moves a step forward Co-X’ toolkit applies additional options for calculat-
and describes a new toolkit named QSAR-Co-X, which ing the modified descriptors using different types of the
apart from supporting the development of multitarget Box-Jenkins moving average operators. It also provides a
QSAR models based on the Box-Jenkins moving aver- modified Y-based randomisation method [15], so-called
age approach, allows the usage of various descriptor gen- Yc-randomisation, to check the robustness of the derived
eration schemes, along with several model development linear model. The latter may be used for ‘condition-wise
strategies, feature selection algorithms and machine prediction’ in which the user may check its predictivity
learning tools, as well as model selection and validation for each experimental/theoretical condition. The rel-
methodologies. As it will be seen, the QSAR-Co-X soft- evance of whole these new utilities implemented in the
ware implements a number of additional utilities that toolkit are exemplified with four case studies.
renders a much more compact and well-designed plat-
form for multitarget QSAR modelling, following the Implementation
principles of QSAR modelling recommended by the The QSAR-Co-X version 1.0.0 is an open source stan-
OECD (Organization for Economic Cooperation and dalone toolkit developed using Python 3 [22]. It can be
Development) [19]. The major differences between these downloaded freely from https://​github.​com/​ncord​eirfc​
two software tools are listed and commented in Table 1. up/​QSAR-​Co-X. The manual provided along with the
As can be seen, two additional feature selection tech- toolkit describes in detail its operating procedures. The
niques were included for establishing LDA models, QSAR-Co-X toolkit comprises four modules, namely: (i)
namely fast-stepwise (FS) and sequential forward selec- LM (abbreviation for linear modelling); (ii) NLG (abbre-
tion (SFS). Even though the GA implemented earlier viation for non-linear modelling with grid search); (iii)
in QSAR-Co has proved to be a highly efficient feature NLU (abbreviation for non-linear modelling with user
selection technique, judging from our previous analyses specific parameters); and (iv) CWP (abbreviation for con-
[11, 20], the implementation of these additional feature dition-wise prediction). Details about the functionalities
selection techniques in QSAR-Co-X improves the scope of each of these modules are described below.
of LDA modelling in multiple ways. Firstly, the applica-
tion of more feature selection techniques enhances the Module 1 (LM)
chances of obtaining more predictive models especially This module assists in dataset division, the calculation
for big data analysis [21]. Secondly, the GA selection of deviation descriptors from input descriptors using
involves the random generation of an initial popula- the Box-Jenkins scheme and data pre-treatment. Along
tion, which usually requires several runs to produce the with these, the module comprises two feature selection
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 3 of 18

Table 1  Major differences between QSAR-Co and QSAR-Co-X


No Utility QSAR-Co QSAR-Co-X Remarks

1 Feature selection One (GA) Two (FS and SFS) –


2 Reproducibility of linear Low High Given the same sample size and
modelling number of descriptors, GA produces
different LDA models on different
runs, whereas both the FS and SFS
always yield the same model
3 Diagnosis of intercollinearity Not available Available and automatically performed Very helpful for ascertaining the
among variables robustness of the derived linear
models
4 Dataset division options Random, Kennard-Stone, Random, pre-defined, k-MCA Since only the random division option
Euclidean-based is fast, the other QSAR-Co options
were replaced to reduce computa-
tional time
5 Automatic generation of the Not available Available Unlike QSAR-Co, QSAR-Co-X allows
validation set generating both the screening and
validation sets
6 Statistical parameters for the Manual calculations are Automatic calculation Automatic calculation allows fast
validation set required selection of the models
7 Number of Box-Jenkins opera- One (pre-defined) Four (three pre-defined and one user- Additional and more flexible operators
tors available specific) were added to QSAR-Co-X
8 Yc randomisation Not available Available A modified form of the Y-randomisa-
tion technique that incorporates the
influence of experimental elements
9 Machine-learning tools One (RF only) Six (kNN, SVM, RF, NB, GB, and MLP) QSAR-Co-X affords several non-linear
modelling tools
10 Number of parameters that 5 8 QSAR-Co-X offers more flexibility for
may be altered in RF model- setting up RF models
ling
11 Comparative analysis of multi- Not possible Possible Useful to decide which ML method
ple ML methods performs best
12 Hyperparameter tuning Not available Available Extremely useful to find optimised
options for ML methods non-linear models
13 User specific parameter set- For RF only For kNN, SVM, RF, NB, GB, and MLP –
tings for building non-linear
models
14 Display of ROC plots (linear For sub-training and test For sub-training, test and validation –
modelling) sets sets
15 Condition-wise prediction Not available Available Useful to understand how the
developed model performs against
individual experimental conditions,
particularly for large datasets

algorithms for development and validation of the LDA about the training and validation set samples, i.e., the set
models (see the screenshot in Fig. 1). The following sixth- samples are to be tagged as ‘Train’ and ‘Test’, respectively.
step procedure is adopted for establishing the linear This is extremely important when the user intends to
models. compare a model with a specific data-distribution previ-
ously derived from any other in silico tool with the mod-
els developed using QSAR-Co-X. In the second scheme
Step 1‑Dataset division (b), the random division of the dataset is obtained on the
The first step of any mt-QSAR model encompasses a divi- basis of the user-specific percentage of validation set data-
sion of the initial dataset into a training and a validation points. At the same time, different training and validation
set. In this module, that may be performed following sets may be obtained by changing the random seed values.
three schemes, namely: (a) pre-determined data distribu- As an alternative to random data-splitting, the user may
tion, (b) random division and (c) k-means cluster analy- opt for a k-Means Cluster Analysis-based rational dataset
sis (kMCA) based data division [20]. In the first scheme division strategy (kMCA) [20, 23]. In the latter option, the
(a), the user is allowed to explicitly provide information
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 4 of 18

Fig. 1  Screenshot of the Module1 graphic user interface from toolkit QSAR-Co-X

dataset is first divided into n (user specific) clusters on the There are different ways for calculating the modified
basis of input descriptors. Subsequently, a specific num- descriptors by this approach, the simplest one being as
ber of validations set samples are randomly collected from follows:
each cluster. Similar to the random division scheme, the
ratio between the training and validation sets may be var-
�(Di ) cj = Di − avg (Di ) cj (1)
ied and, simultaneously, different combinations of these Specifically, the new descriptors �(Di )cj are calculated
sets obtained by changing the random seed value. The by the difference between the input descriptors of the
python code KMCA.py included in the toolkit allows per- active chemicals (Di) and their averages avg (Di )cj − i.e.
forming the kMCA-based dataset division. their arithmetic mean for a specific element of the exper-
imental and/or theoretical conditions (ontology) cj [8]:
Step 2‑box−jenkins moving average approach n(cj )
The most important part of current mt-QSAR model-

avg (Di )cj = Di /n(cj ) (2)
ling is the calculation of the deviation descriptors from i=1
the input descriptors, following the Box-Jenkins moving
average approach. The input descriptors can be calcu- In recent years, different forms for these modified
lated using any commercial or non-commercial software descriptors have however been suggested depending
packages (e.g.: DRAGON [24] or QuBiLS-MAS [25]) but on the conditions. For example, the descriptors may be
then these have to be modified to incorporate the influ- standardised by resorting to the maximum ( Di max ) and
ence of different experimental (and/or theoretical) ele- minimum ( Di min ) values of input descriptors [12]:
ments ( cj). Di − avg (Di )cj
The mathematical details of the Box-Jenkins moving �(Di )cj = (3)
Di max − Di min
average approach have been extensively described in
the past [8, 9, 26], so we will restrict ourselves to a short
description highlighting only its most important aspects.
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 5 of 18

Analogously, the elements of cj may be also standard- and a test set (or calibration set). Here, it is important
ised, as recently proposed by Speck-Planche [27], leading to remark that the avg(Di )cj values obtained from the
to the following expression for the modified descriptors: training set are applied to calculate the modified descrip-
tors for the validation set and thus, the latter can be
Di − avg(Di )cj
�(Di )cj = (4) recognised as the ‘ideal test set’ due to the fact that its
(Di max − Di min ) p(cj )c data-points do not participate either in the model devel-
opment or in the descriptor calculation. On the other
In this equation p(cj ) represents the a priori probability
hand, the test set may be employed both as a ‘calibration
of finding the datapoints pertaining to particular condi-
set’ (especially for GA-LDA) and as an ‘external valida-
tions and so, p(cj )c may simply be obtained by dividing
tion set’.
the number of actives in the data under a specific ele-
ment of cj−n(cj )−by the total number of datapoints N
Step 3‑Data pre‑treatment
(see Eq. 5). More details about this topic will be discussed
The user specific data pre-treatment step of this mod-
within the case study 3 reported in this work.
ule includes: (a) removal of highly correlated descrip-
tors based on the user specified correlation cut-off, and
 
n cj
p(cj )c = (5) (b) removal of the descriptors with less variation based
N
also on the user specified variation cut-off. What is more,
In the present toolkit, the user can choose one of the constant descriptors fail to produce models for all feature
four methods provided (Method1-4) to compute the selection procedures.
modified descriptors. The first three ones are based on
Eqs. 1, 3 and 4, respectively. Note that both Method2 and Step 4‑Linear model development
Method3 do not work with invariant descriptors and that Two feature selection algorithms are used for setting up
may hamper further calculations. Therefore, in these two the linear discriminant analysis (LDA) models, namely:
methods, a descriptor pre-treatment is carried to remove (a) fast stepwise (FS) and (b) sequential stepwise (SFS).
constant descriptors. Finally, Method4 allows the user Although many feature selection algorithms are available,
to apply its own proper scheme for establishing the p(cj ) the two chosen here can be highly efficient while han-
values [27, 28], and the resulting modified descriptors are dling mt-QSAR modelling because of their ability to fast
thus represented as follows: generate models. Both these can be employed along with
the GA selection, which is available in QSAR-Co, but that
Di − avg(Di )cj
�(Di )cj = (6) requires many iterations for finding the optimised LDA
p(cj )u models. FS is a very popular algorithm in which the inde-
where the term p(cj )u denotes the user-specific p(cj ) , pendent descriptors are included in the model stepwise
whose values should be provided as inputs. Within that depending on the specific statistical parameter p-value,
context, the p(cj ) values do not need to be always calcu- and it has previously been successfully employed to set
lated since these may also be obtained from experimen- up mt-QSAR models [10, 26]. The usual criteria for for-
tal and/or theoretical data. As an example, in a previous ward selection (i.e., p-value to enter) and backward elimi-
study [26], p(cj ) accounted for the degree of reliability nation (p-value to remove) are set in the present toolkit.
of the experimental information and the values of 0.55, This is, the descriptor with the lowest p-value is included
0.75 and 1.00 were used for the data-points, which were first and subsequently other descriptors are included in
classified as ‘auto-curation’, ‘intermediate’ and ‘expert’ the model based on the lowest p-value only if the crite-
according to the labelling of the CHEMBL database, ria for forward selection are met. Yet, if the p-value of a
respectively. descriptor included in the model is found to be greater
Similar to QSAR-Co, the current toolkit uses two stages than ‘p-value to remove’, it is eliminated from the model.
of external validation for mt-QSAR modelling, thereby The SFS algorithm adds features into an empty set until
requiring two separate test sets as well. As mentioned the performance of the model is not improved either by
earlier, the dataset is initially split into training and vali- addition of another feature or the maximum number of
dation sets by employing pre-defined sets, random divi- features is reached [29]. Similar to FS, it is also a greedy
sion or kMCA-based systematic division schemes. The search algorithm where the best subsets of descrip-
Box-Jenkins moving average approach is then applied to tors are selected stepwise and the model performance
calculate the modified descriptors for the training set, by is judged by the user specific statistical parameters,
selecting one of the methods described above. The train- denoted as ‘scoring’ parameters. In the current version
ing set and their corresponding modified descriptors are of QSAR-Co-X, two scoring parameters are provided,
subsequently randomly sub-divided into a sub-training namely: ‘Accuracy’ and ‘AUROC’ (see description below).
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 6 of 18

The users may develop separate models by varying these F1 − score = 2TP/(2TP + FP + FN)
two scoring parameters in QSAR-Co-X (see Case Study 4
for more details).
TP · TN − FP · FN
In contrast to GA, in which the generation of models MCC = 
is based on a randomisation process, these two feature [(TP + FP)(TP + FN)(TN + FP)(TN + FN)]
selection algorithms for LDA are systematic and there- (7)
fore faster. In this work, we resorted to the tool Sequen- Apart from confirming the internal and external pre-
tialFeatureSelector from the library mlxtend (version dictivity, the choice of the best linear model should be
0.17.1: http://​rasbt.​github.​io/​mlxte​nd/) for developing guided through additional criteria. For example, highly
the FS-/SFS-LDA models. In both, the singular value correlated descriptors in the linear model may reduce its
decomposition (svd), recommended for data containing overall significance and therefore, the degree of collinear-
large number of features is applied within the Scikit-learn ity among its descriptors must be carefully examined. To
Linear Discriminant Analysis package [30, 31]. do so, the current module automatically generates the
cross-correlation matrix for the selected sub-training set
descriptors. It is also important to assess the applicability
Step 5‑model validation
domain (AD) of the derived model−i.e., the response and
The reliability and statistical significance of the models
chemical structure space within which the model makes
are evaluated by goodness-of-fit as well as by internal and
reliable predictions. Here, the models’ AD is estimated by
external validation criteria.
the standardisation approach as proposed earlier by Roy
Goodness-of-fit for the sub-training set is assessed by
et al. [37], allowing as well to identify possible structural
looking at the usual p and F (Fisher’s statistics) param-
chemical outliers. The python code for this approach is
eters along with the Wilks’ lambda (λ) statistic [32]. The
provided in the applicability.py file of the toolkit.
latter essentially measures the discriminatory power of
the LDA classification models, i.e., how well they sepa-
rate cases into groups. It is equal to the proportion of the Step 6‑Yc‑randomisation
total variance in the discriminant scores not explained by In the previous QSAR-Co [15], the Y-randomization
differences among groups, and can take values from zero, scheme has been implemented to judge the performance
perfect discrimination, to one, no discrimination. Simi- of the derived linear models. That is, following a classi-
lar to Wilk’s λ, the F-test measures how better a com- cal scheme, the statistical quality in data description of
plex model is in comparison to a simpler version of the the original linear model is compared to that of models
same model in itscapacity to explain the variance in the generated upon randomly shuffling several times the
response variable [33]. response variable based upon the user specified ‘num-
All these statistical parameters are calculated with the ber of runs’−n. Since in the Box-Jenkins based mt-QSAR
help of the “Statsmodel” ordinary least square python modelling, the experimental/theoretical conditions ele-
library (https://​www.​stats​models.​org/​stable/​api.​html/). ments participate in the determination of modified
The overall predictivity of the models is checked by descriptors, the Y-randomization is slightly modified
examining the confusion matrix, which includes the here and named Yc-randomization−i.e., Y randomiza-
number of true positive (TP), true negative (TN), false tion with conditions. In this new scheme, along with the
positive (FP) and false negative (FN) samples. Simul- response variables, the experimental elements cj are also
taneously based on those numbers, other statistical scrambled n times, and thus n randomised data-matrices
parameters such as the Sensitivity, Specificity, Accuracy, being generated. The several models are subsequently
F1-score, and the Matthew correlation coefficient (MCC) rederived with these randomised data and averages and
are computed for the sub-training, test and validation the Wilks’ lambda (λr) and accuracy (Accuracyr) calcu-
sets (see Eq.  7), as well as the area under the receiver lated. In a robust model, the values obtained for these
operating characteristic curve (AUROC) [34–36]. Addi- two parameters should be considerably less than Wilks’ λ
tionally, the ROC curves are automatically created for and accuracy of the original model. The phyton code ycr.
each model. py tackles this scheme in QSAR-Co-X.

Sensitivity = TP/(TP + FN) Module 2 (NLG)−hyperparameter tuning


Module 2 assists in setting up non-linear models using a
Specificity = TN/(TN + FP) grid search based hyperparameter optimisation scheme
(see Fig.  2). Six machine learning tools have been so
Accuracy = (TP + TN)/(TP + TN + FP + FN) far implemented in QSAR-Co-X, namely: (a) k-Nearest
Neighbourhood (kNN) [38], (b) Bernoulli Naïve Bayes
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 7 of 18

(NB) classifier [39], (c) Support Vector Classifier (SVC) obtained from Module 2 can be specified in Module 3 for
[40], (d) Random Forests (RF) [18], (e) Gradient Boost- rapid obtention of the optimised models. Other utilities
ing (GB) [41], and f ) Multilayer Perceptron (MLP) neural of Module 3, such as calculation of statistics for internal
networks [42]. For all these non-linear modelling tech- and external validation, pre-treatment of data-files, and
niques, the Scikit-learn machine learning package is used making ROC curves for both the test and the validation
[30, 31]. Similarly, the data pre-treatment option may be sets, are similar to Module 2.
utilised in this module as well as in Module 3. In both
these modules, the sub-training, test and validation sets
Module 4 (CWP)−condition‑wise prediction
set up with Module 1 of QSAR-Co-X are required to be
The QSAR-Co-X toolkit includes this automated and
uploaded one after another for development of the non-
simple analysis tool that can be used for checking the
linear models.
mt-QSAR obtained results. Indeed, since the mt-QSAR
In Module 2, a range of parameters of the machines
modelling implemented in QSAR-Co-X leads to a unique
learning tools are varied to obtain the most robust and
model for datasets containing several experimental and/
predictive non-linear models, based on a n-fold (i.e.,
or theoretical conditions, one may need to assess how
user specific) cross-validation scheme using the Grid-
much the derived model is predictive to a specific con-
SearchCV of Scikit-learn [30, 31]. In this module, a
dition. Module 4 (see Fig.  2) is then to be employed to
parameter file should be provided as.csv file that includes
inspect the models’ performance against each condition,
the parameter names with their values that are required
due to different reasons. For example, if the user often
to be optimised. In https://​github.​com/​ncord​eirfc​up/​
ends up with almost equally predictive models, he/she
QSAR-​Co-X however, six such parameter files related
might select one of them on the basis of being more pre-
to the various machine learning techniques are avail-
dictive towards a particular condition of interest. Moreo-
able, namely: grid_knn.csv, grid_nb.csv, grid_svc.csv,
ver, the conditions over which the model is less predictive
grid_mlp.csv, grid_rf.csv and grid_gb.csv. The param-
may be removed to obtain more predictive and/or more
eter names and their values mentioned in these files are
significant models. Finally, experimental or theoretical
shown in Table  2 below. The files were prepared based
conditions with negligible number of cases may in addi-
upon the importance of the parameters as well as con-
tion be identified through this analysis and if the derived
sidering our previous experience regarding overall time
model is found less predictive towards such conditions,
requirements for the calculations. Nevertheless, the
these may be removed also to rebuild the model.
scope of this module is not only limited to these param-
The overall workflow of this new toolkit along with
eters (and values), because the users may select their
whole of its described modules can be seen in Fig. 3.
own options for hyperparameter tuning by simply alter-
ing them. After selecting the best parameters, internal
validation of the sub-training set is carried out by n-fold Results
(i.e., user-specific) cross validation, as well as external To check as well as to demonstrate the utilities of the
validation of both the test and validation sets. Similar developed QSAR-Co-X toolkit, four case studies pertain-
to Module 1, the statistical results obtained for the non- ing to previously compiled datasets [9, 11, 26, 27] are
linear models are automatically generated along with the examined in this section. For all of them, both the activity
optimised parameters, as well as ROC curves for the test cut-off values and the descriptors employed in the origi-
and validation sets. Similar to QSAR-Co, the non-linear nal publications were used here (exact details about those
models’ AD is determined by the confidence estimation can be found in the original papers). The main purpose of
approach [43, 44]. these four chosen case studies are as follows:

Case study 1: Demonstrate how linear and non-linear


Module 3 (NLU)−user specific parameter settings
mt-QSAR models may be developed with this toolkit.
The functionality of Module 3 (Fig.  2) is the same as
Case study 2: Show how different models may be
that of Module 2, i.e., development of non-linear mod-
generated using different data-splitting facilities of
els. However, in Module 3, the user may specify the
the toolkit.
parameter settings. Since grid search is a time consum-
Case study 3: Describe how models may be gener-
ing but recommended technique, this module could be
ated using the various available Box-Jenkins opera-
used for fast generation of the non-linear models. Even
tors.
after hyper-parameter tuning, the optimised parameters
Case study 4: Perform a comparative analysis
between the model development techniques of the
former QSAR-Co and the new QSAR-Co-X toolkit.
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 8 of 18

Case study‑1 (CS1) test set (nts = 114) and a validation set (nvd = 160), using
The first dataset comprises 726 inhibitors of four I a seed value of 2. The FS-LDA model was then derived
phosphoinositide 3-kinase (PI3K) enzyme isoforms with the following options: (a) correlation cut-off of
(PI3K-α, -β, -γ, -δ), the activities of which have been 0.999, (b) variance cut-off of 0.001, (c) p-value to enter of
assayed against 34 mutated or wild human cell lines 0.05, and (d) p-value to remove of 0.05. Meanwhile, the
[11]. The experimental conditions considered in this SFS-LDA model was built using the following: (a) cor-
dataset can be expressed as an ontology of the form relation cut-off of 0.999, (b) variance cut-off of 0.001, (c)
cj → (bt, cl, mt), i.e., corresponding to the combination Floating = True, and (d) Scoring = Accuracy. For both
of the three following elements: bt (biological enzyme models, a maximum of ten descriptors were allowed, the
target), cl (cell line) and mt (mutated or wild cell lines). sub-training results of being shown in Supplementary
Compounds with ­IC50/Ki /Kd values ≤ 600  nM were Information (Additional file 1: Table S1). As can be seen
assigned as active whereas the remaining data samples in Table S1, the FS-LDA model shows a higher goodness-
considered as inactive. The dataset contained 536 active of-fit than the SFS-LDA model.
(+ 1) and 190 (− 1) inactive compounds and the mt- The FS-LDA model that was developed in the first
QSAR models were developed for predicting the activ- attempt depicted high inter-collinearity with a maximum
ity of inhibitor compounds against these four isoforms Pearson correlation coefficient (r) of 0.926 between two of
of PI3K under various experimental conditions. its descriptors. Therefore, the maximum allowed paired-
correlation coefficient was reduced to 0.90, and the final
Linear interpretable models rebuilt model yielded a Wilk’s λ of 0.261. Similarly, the
The dataset was first divided into a training and valida- first SFS-LDA model developed also presented a high
tion set using a random division scheme (22% of the inter-collinearity between two of its descriptors (r > 0.98).
data taken as the validation set, seed value = 2). Sub- Therefore, the later model was rebuilt by reducing the
sequently, the Box-Jenkins operator (Method1, Eq.  1) correlation cut-off to 0.95, and this revised SFS-LDA
was applied to produce a sub-training set (nstr = 452), a model depicted a much satisfactory inter-collinearity

Fig. 2  Screenshots of the Module 2 (a), Module 3 (b), and Module 4 (c) graphic user interface from toolkit QSAR-Co-X
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 9 of 18

among descriptors (maximum r = 0.808). The overall pre- structural outliers compared to the SFS-LDA model. The
dictivity of the linear models is depicted in Table 3. ROC plots of FS-LDA and SFS-LDA models generated
As can be seen, the SFS-LDA model was found to be with the current toolkit can be found in Supplementary
more predictive than the FS-LDA model. The average Information (Additional file 1: Figure S1).
accuracy and MCC values found for the newly devel-
oped SFS-LDA model are 94.95% and 0.873, respectively. Non‑linear models
After analysing the AD computed by the standardisa- This dataset was then subjected to non-linear model
tion approach, in the FS-LDA model, 15 data-points of development using the QSAR-Co-X toolkit. For such a
the sub-training set, 6 data-points of the test set, and 5 purpose, the hyperparameter tuning implemented in its
data-points of the validation set are found to be outliers. Module 2 was employed. Details about the correspond-
While, in the SFS-LDA model, 43 sub-training set, 13 test ing optimised parameters along with the accuracy values
set and 14 validation set samples emerged as structural obtained for the sub-training, test and validation sets
outliers. Therefore, based on the results of AD, it may can be found in Supplementary Information (Additional
be inferred that the FS-LDA model was developed with file 1: Table S2). It can be observed that, except for Ber-
descriptors that yield a considerably smaller number of noulli NB, all other machine learning tools are able to
produce highly predictive mt-QSAR models. However,
the RF and GB tools lead to the most significant non-
linear mt-QSAR models, judging from their internal and
Table 2 Hyper-parameters tuning options available in QSAR-
external validation parameters (i.e., accuracy in this case;
Co-X toolkit
see Table  4). Although the same accuracy is obtained
Technique Parameters ­tuninga for the validation set, on the basis of overall predictivity,
RF Bootstrap: True/ ­Falseb the RF model is found to be slightly superior to the GB
Criterion: Gini, Entropy, model. Table  4 shows the overall statistical predictivity
Maximum depth: 10, 30, 50, 70, 90, 100, 200, None of the latter two models, whereas the ROC plots for the
Maximum features: Auto, Sqrt validation and test sets are depicted in Supplementary
Minimum samples leaf: 1, 2, 4
Information (Additional file  1: Figure S2). Interestingly,
Minimum samples split: 2, 5, 10
the external predictivity of the RF model matches exactly
Number of estimators: 50, 100, 200,500
with the FS-LDA model (cf. Table 3).
kNN Number of neighbours: 1–50
Finally, Module 4 of QSAR-Co-X was applied for a
Weight options: Uniform, Distance
condition-wise prediction of the FS-LDA model, and the
Algorithms: Auto, Ball tree, kd_tree, brute
obtained results are listed in Table 5. Note that a similar
Bernoulli NB Alpha:1, 0.5, 0.1
analysis might have been also performed with any of the
Fit_prior: True, False
non-linear models. Here, it should be mentioned that the
SVC C: 0.1, 1, 10, 100, 1000
present dataset pertains to as many as 34 experimental
condition elements, and from Table 5 it can be observed
Gamma: 1, 0.1, 0.01, 0.001
that not all the latter appear in both the test and valida-
Kernel: RBF, Linear, Poly, Sigmoid
tion sets. However, owing to the high external predictiv-
MLP Hidden layer sizes: To be specified by the user
ity of the model, most of these experimental elements are
Activation: Identity, Logistic, Tanh, Relu
predicted with high accuracy values. Nevertheless, it can
Solver: SGD, Adam
be additionally seen that samples pertaining to elements
Alpha: 0.0001, 0.001, 0.01, 1
18 and 24 are not only present in less number but are also
Learning rate: Constant, Adaptive, Invscaling
poorly predicted. These samples may then be removed, or
GB Loss: deviance, exponential
alternate models been generated with other techniques in
Learning rate: 0.01, 0.05, 0.1, 0.2
which the predictivities for these experimental condition
Min samples split: 0.1,0.2,0.3,0.4,0.5
elements are higher. Similarly, a ‘condition-wise predic-
Minimum samples leaf: 0.1,0.2,0.3,0.4,0.5
tion’ analysis might also be performed using the derived
Maximum depth: 3,5,8
non-linear models with the help of the present module.
Maximum features: Log2, Sqrt
The results, i.e., the output files generated for the FS-
Criterion: Friedman MSE, MAE
LDA, SFS-LDA, RF and GB models of CS1 are given in
Subsample: 0.5, 0.6, 0.8
Additional file 2.
Number of estimators: 50,100,200,300
a
  For further details on these parameters, check the manual associated with the
toolkit in https://​github.​com/​ncord​eirfc​up/​QSAR-​Co-X
b
  This option is automatically selected
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 10 of 18

Fig. 3  Illustration of the overall functionalities of toolkit QSAR-Co-X

Case study‑2 (CS2) one out of two possible categories, namely positive (+ 1)
The second case study aims at investigating the impact of or negative (− 1). Cut-off values for different measures
data-distribution during the development of mt-QSAR of toxicity effects of compounds are provided in Supple-
models. Further, the significance of Yc randomization as mentary Information (Additional file 1: Table S4).
an extra criterion for justifying the robustness of linear Two different models were generated and in the first
models is aimed to be demonstrated also. A previously case the probabilistic factor pc was discarded, and the
collected dataset [26] will be employed, which contains models developed using ‘Method1’. Then, in the sec-
46,229 datapoints describing the anti-bacterial activ- ond case, the models were developed considering the
ity against Gram-negative pathogens and in  vitro safety influence of pc and due to its presence, the Box-Jenkins
profiles related to absorption, distribution, metabolism, operator based on ‘Method4’ (Eq. 6) was employed. For
elimination, and toxicity (ADMET) properties. This data- both cases, we applied three dataset distribution meth-
set pertains to four experimental condition elements (cj), ods available in QSAR-Co-X for splitting the data into
namely: bt (biological target), me (measure of effect), ai the training and validation sets. In the first method
(assay information), and tm (target mapping). Addition- (i.e., pre-defined sets), the training (75% of the data)
ally, each datapoint includes a probabilistic factor pc to and validation (25% of the data) sets coming from the
account for the degree of reliability of the experimental original work were used. In the second method (i.e.,
information. Each case in the data set was assigned as random division), 25% of the data was placed in the
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 11 of 18

validation set using a random seed value of 2. In the are no significant differences between these models as
third method (i.e., kMCA based division), the data was regards their statistical quality indicating that no matter
divided into ten clusters and, from each of these, 25% of which data-distribution method is considered, the quality
the data was selected as the validation set, and subse- of the linear model remains almost similar. However, after
quently each training set was divided into sub-training verifying the internal and external validation results, the
(80%) and test (20%) sets using a random seed value of random division-based model is seen to be the best linear
3. For each of these data distributions, SFS-LDA mod- mt-QSAR model. Further, the degree of collinearity among
els were developed using the current toolkit with the the variables of the model is not too high, the maximum
following parameters: (a) correlation cut-off of 1.0, (b) correlation coefficient between two of its descriptors being
variance cut-off of 0.001, (c) maximum steps = 6, (d) 0.831. To further judge the statistical significance of this
Floating = True, and (e) Scoring = Accuracy. The statis- model, we applied the Yc randomization scheme imple-
tical results then gathered as well as the ROC plots for mented in QSAR-Co-X. To do so, the response variable and
the derived three linear models can be found in Sup- experimental elements were randomised 100 times, and
plementary Information (Additional file  1: Figure S3, the resulting 100 randomised data matrices were then sub-
Tables S3 and S4). The latter plots along with the cor- jected to the same Box-Jenkins operator (i.e., ‘Method4’)
responding AUROC values allows one to infer the clas- used for generating the original model. Subsequently, 100
sification ability of the generated mt-QSAR models. models were created with the randomised sub-training
As one may observe from Additional file  1: Table  S4, set using the descriptors of the original model. The aver-
irrespectively of the data-distribution method used, the age Wilk’s λ (λr) and average accuracy (Accuracyr) found
models generated with ‘Method4’ display slightly better for such models were 0.999 and 58.09, respectively, which
statistical parameters. That thus suggests that the prob- compared to those attained for the original model (i.e.,
abilistic factor considered in the original investigation 0.432 and 96.37) confirm that the latter is unique and lacks
truly influences in determining the response variable. chance correlations. The results, i.e., the output files from
Focusing now only on ‘Method4’ based models, the the current toolkit, of these SFS-LDA models for CS2 are
Wilk’s λ values obtained for these pre-defined, random shown in Additional file 3.
and kMCA division-based models were 0.438, 0.432 and
0.440, respectively. Such low values for the sub-training Case study‑3 (CS3)
sets show that all these models display an adequate dis- The purpose of third case study is to disclose how dif-
criminatory power and a satisfactory goodness-of-fit. In ferent Box-Jenkins’s operators may have an impact on
addition, at first sight (Additional file  1: Table  S4), there the statistical quality of the derived models. The dataset

Table 3  Overall predictivity of the linear models produced for CS1


Classificationa FS-LDA SFS-LDA
d e f
Str Ts Vd Strd Tse Vdf

TP 332 77 110 333 77 110


TN 102 32 36 106 33 36
FP 9 3 8 5 2 8
FN 9 2 6 8 2 6
Sn (%) 91.89 91.43 81.82 95.49 94.29 81.82
Sp (%) 97.36 97.47 94.83 97.65 97.47 94.83
Acc (%) 96.02 95.61 91.25 97.12 96.49 91.25
F1 score (%) 97.36 96.85 94.02 98.08 97.47 94.02
MCCb 0.892 0.896 0.778 0.923 0.917 0.778
AUROCc 0.946 0.944 0.883 0.966 0.959 0.883
The most significant results are highlighted in bold
a
  TP: True positive, TN: True negative, FP: False positive, FN: False negative, Sn: Sensitivity, Sp: Specificity, Acc: Accuracy.
b
  Matthews correlation coefficient.
c
  Score for the area under the receiver operating characteristic curve.
d
  Sub-training set.
e
  Test set.
f
  Validation set
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 12 of 18

of CS3 was retrieved from a recently published work in nT (ep )


which the toxicity of 260 pesticides have been targeted p(ep )tc = (10)
NT (tc )
by mt-QSAR modelling with artificial neural networks
(ANN) [27]. The dataset comprised a total of 3610 data- where nT (cj ) and NT (cj2 ) stand for the number of the
points related to four primary experimental condition training set samples, including toxic and non-toxic data
elements (cj), namely: me (measure of toxicity), bs (bioin- points, within the primary and secondary experimental
dicator species), ag (assay guideline) and ep (exposure elements, respectively.
period). For detailed information about the cut-off values In that work, another probabilistic factor was also
employed for the different measures of toxicity effects, included based on the following equation [27]:
please refer to the Supplementary Information (Addi-
n(ag )
tional file  1: Table  S5). Further details about me, bs, ag p(ag ) = (11)
NT
and ep can be obtained from the original work [27]. The
dataset contained 1992 toxic (+ 1) and 1618 nontoxic where NT stands for the total number of samples in the
(− 1) compounds. Additionally, three other experimental training set, and notably this equation is just like Eq.  5,
condition elements have been taken into consideration already implemented within one of the Box-Jenkins
while modelling, these being the concentration lethality operators (‘Method3’) in QSAR-Co-X, because it merely
(lc), target mapping (tm) and time classification (tc). The corresponds to a normalisation by all the number of
latter three may be specified as secondary experimental elements.
elements ( cj2 ) due simply to the fact that lc, tm and tc are Each of these probabilistic factors may be simply
related to me, bs and ep, respectively. On the basis of these denoted as p(cj ) and so, the final deviation descriptors
related primary and secondary experimental elements, employed in such a work [27] are similar to the stand-
three probabilistic factors were calculated in that work as ardised modified descriptors presented in Eq. 4. Yet these
follows [27]: final descriptors embody a more complex moving aver-
nT (me ) age operator that is not implemented in QSAR-Co-X (cf.
p(me )lc = (8) Equations 3–6). Yet ‘Method4’ (Eq. 6) may still be applied
NT (lc )
with a slight modification to obtain the same modified
descriptors used in that work. To that end, the python
nT (bs ) code of ‘Method4’ was adapted to calculate the modi-
p(bs )tm = (9)
NT (tm ) fied descriptors (‘Method4 modified’, cf. Table  6) from
the starting descriptors reported in such work [27]. Then,
non-linear mt-QSAR models were developed using a

Table 4  Overall predictivity of the derived RF and GB models


Classificationa RF GB
d e f
Str (fivefold CV) Ts Vd Str (fivefold CV)d Tse Vdf

TP 330 77 110 331 75 108


TN 98 32 36 97 32 38
FP 13 3 8 14 3 6
FN 11 2 6 10 4 8
Sn (%) 96.77 91.43 81.82 97.07 91.43 86.36
Sp (%) 88.29 97.47 94.83 87.39 94.94 93.10
Acc (%) 94.69 95.61 91.25 94.69 93.86 91.25
F1 score (%) 96.49 96.85 94.02 96.50 95.54 93.91
MCCb – 0.896 0.778 – 0.857 0.784
AUROCc – 0.944 0.883 – 0.932 0.897
a
  TP: True positive, TN: True negative, FP: False positive, FN: False negative, Sn: Sensitivity, Sp: Specificity, Acc: Accuracy
b
  Matthews correlation coefficient
c
  Score of area under the receiver operating characteristic curve
d
  Sub-training set
e
  Test set
f
  Validation set
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 13 of 18

Table 5  Condition-wise prediction for the FS-LDA model built in CS1


SN Experimental condition element (cj) Test set Validation set
cl mt bt #Instances %Accuracy #Instances %Accuracy

1 Normal- MCF7-neo/Her2 Non-mutant PI3K-α 2 100 3 100


2 Normal-B-cells Non-mutant PI3K-δ 8 100 12 100
3 Normal-BT20 Non-mutant PI3K-α 2 100 1 100
4 Normal-BT474 Mutant PI3K-α 7 85.71 10 90
5 Normal-BT474 Non-mutant PI3K-α 14 85.71 13 84.62
6 Normal-HCC1954 Non-mutant PI3K-β 1 100 1 100
7 Normal-HCT116 Mutant PI3K-α 4 100 2 100
8 Normal-HCT116 Non-mutant PI3K-α 1 100 3 100
9 Normal-HEK293 Non-mutant PI3K-β 1 100 na na
10 Normal-HL60 Non-mutant PI3K-α 3 66.67 6 100
11 Normal-HL60 Non-mutant PI3K-β 5 100 2 50
12 Normal-HL60 Non-mutant PI3K-γ 2 100 na na
13 Normal-HL60 Non-mutant PI3K-δ na na 6 83.33
14 Normal-HL60 Non-mutant PI3K-γ na na 6 100
15 Normal-JeKo1 Non-mutant PI3K-δ 4 100 4 100
16 Normal-MDA-MB-453 Mutant PI3K-α 4 100 5 100
17 Normal-MDA-MB-468 Non-mutant PI3K-β 3 100 10 100
18 Normal-PBMC Non-mutant PI3K-δ na na 1 0
19 Normal-PC3 Non-mutant PI3K-α 7 100 12 91.67
20 Normal-PC3 Non-mutant PI3K-β 2 100 na na
21 Normal-PC3 Non-mutant PI3K-γ 1 100 na na
22 Normal-Ramos Non-mutant PI3K-δ 1 100 na na
23 Normal-Ri-1 Non-mutant PI3K-δ na na 5 80
24 Normal-THP1 Non-mutant PI3K-β 1 0 na na
25 Normal-THP1 Non-mutant PI3K-δ 3 100 6 66.67
26 Normal-THP1 Non-mutant PI3K-γ 1 100 na na
27 Normal-U2OS Non-mutant PI3K-α 2 100 3 100
28 Normal-U87MG Non-mutant PI3K-α 7 100 15 86.67
29 Normal-U937 Non-mutant PI3K-δ 1 100 na na
30 PTEN-deficient-MDA-MB-468 Non-mutant PI3K-β 5 100 7 100
31 PTEN-deficient-PC3 Non-mutant PI3K-β 10 100 19 89.47
32 PTEN-deficient-U87MG Non-mutant PI3K-α 3 100 na na
33 PTEN-Null-MDA-MB-468 Non-mutant PI3K-β 8 100 8 100
34 PTEN-Null-PC3 Non-mutant PI3K-α 1 100 na na
The experimental condition elements not well predicted by the model are highlighted in bold

pre-defined data-distribution, i.e. to use the same train- cut-off in 0.95 and the variance cut-off in 0.001. In addi-
ing and validation sets employed in the original work tion, a fivefold cross-validation was used for grid search
[27]. Eighty percent of the training dataset was treated as well as for inspecting the internal predictivity of the
as the sub-training set whereas the remaining was used sub-training set. After developing the model using the
as the test set for setting up RF based non-linear mod- adapted ‘Method4’, this model was also compared to
els. However, instead of employing pre-selected features models derived based on other operators (i.e., with the
for developing the non-linear models, just as it has been original Methods1–4) implemented in QSAR-Co-X.
done on that original work, here we resort to a maxi- However, to calculate the descriptors using Methods 1–3,
mum descriptor space for model generation. In order to the probabilistic factors (i.e., the original p(me )lc , p(bs )tm ,
remove less descriptive highly correlated features, a data and p(ep )tc factors) could not be accommodated. There-
pre-treatment was carried out by setting the correlation fore, for these methods the influence of all secondary
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 14 of 18

experimental elements was discarded. However, these or HC50 ≥ 105.7  μM. For more details, please refer to
probabilistic factors were considered in the model devel- the original investigation [9]. Mt-QSAR modelling of this
oped by Method4. The results of the RF models devel- dataset has already been performed using the QSAR-
oped with all five type of moving average operators and Co tool [15], being the linear model developed with the
related deviation descriptors are shown in Table 6. GA-LDA technique and the non-linear model with the
As seen, the models obtained here reveal to display RF technique. In this case study, three additional linear
more predictive ability than that of the model reported models were built using QSAR-Co-X, keeping the same
in the original investigation (MCC score of 0.524 for the maximum number of descriptors (i.e., four) and data-
test set) [27]. Nevertheless, the latter is more interpret- distributions. Table  7 shows the statistical parameters
able since only a limited number of features was used for obtained for all these models. Note that two LDA mod-
its development. Therefore, a direct comparison of the els were set up by applying SFS for feature selection with
reported model with the current RF models is not fea- the two different scoring parameters (i.e., Accuracy and
sible, yet nor it is the purpose of the current case study. AUROC).
Rather, our aim here is to disclose the importance of The Wilks’ lambda (λ) value obtained for the original
different operators implemented in QSAR-Co-X. Even developed GA-LDA model is 0.422, whereas those of the
though the variations in the operators did not have sig- FS-LDA, SFS-LDA (Scoring: Accuracy) and SFS-LDA
nificant impact on the statistical quality of all these mod- (Scoring: AUROC) models are 0.421, 0.444 and 0.451,
els, the mt-QSAR model obtained from ‘Method1’ is respectively. As seen in Table 7, among the QSAR-Co-X
found to produce the best solution relying on both inter- linear models, the SFS-LDA model generated with the
nal and external predictivity. However, this outcome is AUROC scoring parameter is found to be the best one,
based only on one data-distribution technique and one judging from its overall predictivity results. Furthermore,
machine learning method. Therefore, no final conclusion overall predictivity of this model is significantly higher
might be drawn regarding the utility of these operators. than that of the GA-LDA model previously reported [15].
The case study however demonstrates that the multiple Similarly, in this case study, we also developed two
operators implemented in QSAR-Co-X may be utilised non-linear models through the RF and GB techniques.
to judge which option is most suitable for a specific data. It is important to mention here that QSAR-Co does not
The results, i.e., the output files from the current toolkit, provide any option for hyperparameter optimisation and
obtained from RF model by applying Method1 for CS3 therefore the earlier reported RF model has been gener-
are given in Additional file 4. ated without it. On the other hand, the models generated
Finally, it is important to remark here that, the previ- by QSAR-Co-X were set up with hyperparameter optimi-
ously reported model was developed by resorting to a sation by supplying the values for the parameter settings
commercial software. in its Module 2. Table  8 shows the attained results for
these models.
Case study‑4 (CS4) By inspecting the statistical parameters given in
Case studies 1–3 were examined mainly to demonstrate Table  8, it is clear that the GB model affords the best
some of the basic utilities of QSAR-Co-X. In the final predictivity and leads to a significant improvement in
case study, we attempted however to compare the per- the external predictive accuracy when compared to that
formances of previously reported QSAR-Co models with of the previously reported RF model generated with
newly created QSAR-Co-X models. For such purpose, we QSAR-Co. However, it is noteworthy that the signifi-
collected a previously reported dataset containing 2,123 cance of this GB based model is not only limited to its
peptides (amino acid length 4–119) with antibacterial better performance. Since this model has been devel-
activities against multiple Gram-negative bacterial strains oped with hyperparameter optimization, its overall
and cytotoxicity against multiple cell types [9]. This data- acceptability is much higher than the RF model gener-
set pertains to two experimental condition elements (cj), ated with QSAR-Co, without any tuning of hyperpa-
namely: bs (biological target) and me (measure of effect). rameters [45, 46]. On the whole, the results shown in
Each peptide in the data set was assigned to one out of Tables 7, 8 clearly suggest that the QSAR-Co-X toolkit
two possible categories, namely: positive (+ 1) − i.e., indi- provides some very useful strategies for setting up lin-
cating high antibacterial activity or low cytotoxicity, or ear and non-linear mt-QSAR models.
negative (− 1) − i.e., showing low antibacterial activity or The results of the SFS-LDA and GB models, i.e., the out-
high cytotoxicity. The cut-off values to annotate a peptide put files from the current toolkit, obtained for CS4 are given
as positive were: MIC ≤ 14.97  μM, or CC50 ≥ 60.91  μM, in the Supplementary Information (Additional file 5).
Table 6  Overall performance of the final RF models in CS3
Classificationb Method4 modified Method1 (Eq. 1) d Method2 (Eq. 3) d Method3 (Eq. 4) d Method4 (Eq. 6)
Halder and Dias Soeiro Cordeiro J Cheminform

e f g e f g e f g e f g e
Str Ts Vd Str Ts Vd Str Ts Vd Str Ts Vd Str Tsf Vdg

TP 1011 252 419 1023 251 420 1015 250 412 1030 243 418 1020 249 427
TN 744 190 303 753 193 315 746 193 313 730 187 303 731 188 302
FP 230 56 95 221 53 83 228 53 85 244 59 95 243 58 96
(2021) 13:29

FN 191 46 73 179 47 72 187 48 80 172 55 74 182 49 65


Sn (%) 84.11 77.24 76.13 85.11 78.45 79.15 84.44 78.45 78.64 85.69 76.02 76.13 84.86 76.42 75.88
Sp (%) 76.39 84.56 85.16 77.31 84.22 85.37 76.59 83.89 83.74 74.95 81.54 84.96 75.05 83.56 86.79
Acc (%) 80.65 81.25 81.12 81.62 81.62 82.58 80.93 81.43 81.46 80.88 79.04 81.01 80.47 80.33 81.91
F1 score (%) 82.77 83.17 83.3 83.65 83.39 84.42 83.03 83.19 83.32 83.2 81 83.18 82.76 82.31 84.13
MCC c 0.608 0.621 0.617 0.627 0.628 0.647 0.613 0.625 0.625 0.612 0.576 0.614 0.604 0.602 0.633
a
  The most significant results are highlighted in bold. All the models were generated using random state ‘None’ in Module 2 of the toolkit
b
  TP: True positive, TN: True negative, FP: False positive, FN: False negative, Sn: Sensitivity, Sp: Specificity, Acc: Accuracy
c
  Matthews correlation coefficient
d
  No secondary experimental elements used
e
  Sub-training set
f
  Test set
g
  Validation set
Page 15 of 18
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 16 of 18

Table 7  Overall performance of the final linear models for CS4


Classificationb QSAR-Coa QSAR-Co-X

GA-LDA FS-LDA SFS-LDA SFS-LDA

(Scoring: Accuracy) (Scoring: AUROC)


Strc Tsd Vde Strc Tsd Vde Strc Tsd Vde Strc Tsd Vde

TP 941 418 315 940 422 323 934 413 322 930 407 328
TN 932 389 311 925 388 302 947 393 309 956 406 317
FP 67 33 16 74 34 25 52 29 18 43 16 10
FN 97 33 40 98 29 32 104 38 33 108 44 27
Sn (%) 90.65 92.68 88.73 92.59 91.94 92.35 94.79 93.13 94.49 95.7 96.21 96.94
Sp (%) 93.29 92.18 95.11 90.56 93.57 90.99 89.98 91.57 90.7 89.59 90.24 92.39
Acc (%) 91.95 92.44 91.79 91.56 92.78 91.64 92.34 92.32 92.52 92.59 93.13 94.57
MCCf 0.839 0.849 0.838 0.831 0.855 0.833 0.848 0.847 0.851 0.853 0.864 0.893
The most significant results are highlighted in bold
a
  Model previously reported in [21]
b
  TP: True positive, TN: True negative, FP: False positive, FN: False negative, Sn: Sensitivity, Sp: Specificity, Acc: Accuracy
c
  Sub-training set
d
  Test set
e
  Validation set
f
  Matthews correlation coefficient

Table 8  Overall performance of the final non-linear models for case study 4
Classification b ­ POc/QSAR-Co)d
RF (without H RF (with HPO/QSAR-Co-X) GB (with HPO/QSAR-Co-X)
e f g e f g
Str(tenfold CV) Ts Vd Str(tenfold CV) Ts Vd Str (tenfold CV)e Tsf Vdg

TP 994 431 341 969 433 343 996 443 346


TN 953 405 317 936 405 316 949 406 318
FP 46 17 10 63 17 11 50 16 9
FN 44 20 14 69 18 12 42 8 9
Sn (%) 95.76 95.57 96.06 93.35 95.97 96.64 95.95 96.21 97.46
Sp (%) 95.4 95.97 96.94 93.69 96.01 96.62 94.99 98.22 97.25
Acc (%) 95.58 91.52 96.48 93.52 95.99 96.63 95.48 97.25 97.36
MCCh 0.912 0.915 0.93 0.884 0.920 0.932 0.91 0.945 0.947
a
  The most significant results are highlighted in bold. QSAR-Co-X were generated using random state 1 in Module 2 of the toolkit
b
  TP: True positive, TN: True negative, FP: False positive, FN: False negative, Sn: Sensitivity, Sp: Specificity, Acc: Accuracy
c
  HPO: Hyperparameter optimisation
d
  Model previously reported in [15]
e
  Sub-training set
f
  Test set
g
  Validation set
h
  Matthews correlation coefficient

Conclusions schemes for calculation of modified descriptors, feature


In this work, we described the user-friendly open-source selection algorithms, machine learning methods, valida-
QSAR-Co-X toolkit that is an extension of our previously tion strategies as well as analysis techniques. The QSAR-
launched java-based tool QSAR-Co [15], and has a num- Co-X toolkit is based on Python, which is undoubtedly
ber of advantages over the latter to support mt-QSAR one of the most popular and highly accessed program-
modelling efforts. Indeed, the current toolkit move a step ming languages, especially in the field of data science
forward by including more updated and advanced strate- [22]. The current toolkit utilises some well-known
gies, namely in what concerns data-distribution options, Python based libraries, such as NumPy [47], SciPy [48],
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 17 of 18

Pandas [49], Matplotlib [50], Tkinter (https://​anzel​jg.​ Authors’ contributions


AKH designed and implemented the software. MNDSC tested the software. All
github.​io/​rin2/​book2/​2405/​docs/​tkint​er/​index.​html), and authors read and approved the final manuscript.
Scikit-learn [30, 31]. The codes of the toolkit are made
available in public domain so that, necessary modifica- Availability of data and materials
Project name: QSAR-Co-X.
tions/updates may be easily implemented during their Project home page: The source code of the toolkit along with its manual
utilisation. Similar to QSAR-Co, this toolkit relies pri- and reference data files are available from https://​github.​com/​ncord​eirfc​up/​
marily on Box-Jenkins based mt-QSAR modelling, which QSAR-​Co-X.
Operating system(s): Platform independent.
has been proved to be highly efficient in handling large Programming language: Python.
datasets pertaining to various experimental and/or the- Other requirements: NumPy, SciPy, Pandas, Matplotlib, Tkinter and Scikit-learn.
oretical conditions[10–15, 20, 26–28, 51]. Further, the License: GNU GPL version 3.
Any restrictions to use by non-academics: None.
ability to explore all of its code tools simultaneously, as
well as the graphical user interface itself, provide simple
and efficient solutions to the main practical challenges Declarations
implicated in mt-QSAR modelling. The latter was clearly Competing interests
shown by testing its functionalities on four case studies. The authors declare that they have no competing interests.
Indeed, we were able to demonstrate the basic utilities of Received: 6 December 2020 Accepted: 31 March 2021
its tools and at the same time, depicted also how differ-
ent feature selection algorithms, machine learning meth-
ods, dataset division options and different Box-Jenkins’s
operators may play crucial roles in the development of References
more predictive mt-QSAR models. The toolkit allows 1. Muratov EN, Bajorath J, Sheridan RP, Tetko IV, Filimonov D, Poroikov V,
the users to save the developed models and use these for Oprea TI, Baskin II, Varnek A, Roitberg A, Isayev O, Curtalolo S, Fourches D,
Cohen Y, Aspuru-Guzik A, Winkler DA, Agrafiotis D, Cherkasov A, Tropsha
predicting properties of new external chemicals. Clearly, A (2020) QSAR without borders. Chem Soc Rev 49:3525–3564
future investigations using various datasets will lead to a 2. Lewis RA, Wood D (2014) Modern 2D QSAR for drug discovery. WIRE-
better understanding about the utilities and short-com- Comput Mol Sci 4:505–522
3. Neves BJ, Braga RC, Melo CC, Moreira JT, Muratov EN, Andrade CH (2018)
ings of the functionalities of the present toolkit and will QSAR-based virtual screening: advances and applications in drug discov-
naturally give rise to its upgrading. Yet, on the whole, the ery. Front Pharmacol 9:1275
toolkit presented here has the potential of becoming a 4. Gramatica P (2020) Principles of QSAR Modeling: Comments and sugges-
tions from personal experience. Int J Quant Struc-Prop Relation 5:61–97
widely used platform for easily setting up predictive mt- 5. Toropov AA, Toropova AP (2020) QSPR/QSAR: State-of-art, weirdness, the
QSAR models. future. Molecules 25:1292
6. Polanski J (2017) Big data in structure-property studies—from definitions
to models. In: Roy K (ed) Advances in QSAR Modeling. Challenges and
Supplementary Information Advances in Computational Chemistry and Physics. Springer, Cham
The online version contains supplementary material available at https://​doi.​ 7. Speck-Planche A (2018) Recent advances in fragment-based computational
org/​10.​1186/​s13321-​021-​00508-0. drug design: tackling simultaneous targets/biological effects. Future Med
Chem 10:2021–2024
8. Speck-Planche A, Cordeiro MNDS (2017) Advanced in silico approaches
Additional file 1. File containing the QSAR-Co-X generated ROC plots for drug discovery: mining information from multiple biological and
(Figures S1–3) and additional information related to the several case stud- chemical data through mtkQSBER and pt-QSPR strategies. Curr Med Chem
ies (Tables S1–5). 24:1687–1704
Additional file 2. Folder (CS_1) containing the results (i.e., the output files 9. Kleandrova VV, Ruso JM, Speck-Planche A, Cordeiro MNDS (2016) Enabling
from the current toolkit) of the FS-LDA, SFS-LDA, RF and GB models for the discovery and virtual screening of potent and safe antimicrobial pep-
case study 1. tides. Simultaneous prediction of antibacterial activity and cytotoxicity. ACS
Comb Sci 18:490–498
Additional file 3. Folder (CS_2) containing both the input files and the 10. Halder AK, Natalia M, Cordeiro MNDS (2019) Probing the environmental tox-
results (i.e., the output files from the current toolkit) of the SFS-LDA mod- icity of deep eutectic solvents and their components: An in silico modeling
els for Case study-2. approach. ACS Sust Chem Eng 7:10649–10660
Additional file 4. Folder (CS_3) containing both the input files and the 11. Halder AK, Cordeiro MNDS (2019) Development of multi-target chemomet-
results (i.e., the output files from the current toolkit) obtained from the RF ric models for the inhibition of class i pi3k enzyme isoforms: a case study
model by applying Method1 for Case study-3. using QSAR-Co tool. Int J Mol Sci 20:4191
12. Speck-Planche A (2019) Multicellular target QSAR model for simultane-
Additional file 5. Folder (CS_4) containing the input file of SFS-LDA and ous prediction and design of anti-pancreatic cancer agents. ACS Omega
GB models and the results (i.e., the output files from the current toolkit) 4:3122–3132
obtained from the SFS-LDA for Case study-4. 13. Speck-Planche A, Scotti MT (2019) BET bromodomain inhibitors: fragment-
based in silico design using multi-target QSAR models. Mol Divers
23:555–572
Acknowledgements
14. Kleandrova VV, Scotti MT, Scotti L, Nayarisseri A, Speck-Planche A (2020)
This work received financial support from FCT - Fundação para a Ciência e
Cell-based multi-target QSAR model for design of virtual versatile inhibitors
Tecnologia through funding for the project PTDC/QUI-QIN/30649/2017. The
of liver cancer cell lines. SAR QSAR Environ Res 31:815–836
authors would like to thank also the FCT support to LAQV-REQUIMTE (UID/
QUI/50006/2020).
Halder and Dias Soeiro Cordeiro J Cheminform (2021) 13:29 Page 18 of 18

15. Ambure P, Halder AK, Diaz HG, Cordeiro MNDS (2019) QSAR-Co: An 33. Hans-Vaugn DL, Lomax RG (2020) An introduction to statistical concepts.
open source software for developing robust multitasking or multitarget Routledge, NY
classification-based QSAR models. J Chem Inf Model 59:2538–2544 34. Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced
16. Rogers D, Hopfinger AJ (1994) Application of genetic function approxi- data using Matthews Correlation Coefficient metric. PLoS ONE 12:e0177678
mation to quantitative structure-activity-relationships and quantitative 35. Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett
structure-property relationships. J Chem Inf Comput Sci 34:854–866 27:861–874
17. Ambure P, Aher RB, Gajewicz A, Puzyn T, Roy K (2015) “NanoBRIDGES” 36. Hanczar B, Hua JP, Sima C, Weinstein J, Bittner M, Dougherty ER (2010) Small-
software: Open access tools to perform QSAR and nano-QSAR modeling. sample precision of ROC-related estimates. Bioinformatics 26:822–830
Chemometrics Intellig Lab Syst 147:1–13 37. Roy K, Kar S, Ambure P (2015) On a simple approach for determining appli-
18. Breiman L (2001) Random forests. Mach Learn 45:5–32 cability domain of QSAR models. Chemometr Intell Lab Sys 145:22–29
19. Organization for Economic Co-Operation and Development (OECD). 38. Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans
Guidance document on the validation of (quantitative) structure-activity Inf Theory 13:21–27
relationship ((q)sar) models; OECD Series on Testing and Assessment 69; 39. McCallum A, Nigam K (2001) A comparison of event models for naive bayes
OECD Document ENV/JM/ MONO2007, pp 55−65. text classification. Work Learn Text Categ 752:41–48
20. Halder AK, Giri AK, Cordeiro MNDS (2019) Multi-Target chemometric model- 40. Boser BE, Guyon IM, Vapnik VN A training algorithm for optimal margin
ling, fragment analysis and virtual screening with erk inhibitors as potential classifiers. In Proceedings of the fifth annual workshop on Computational
anticancer agents. Molecules 24:3909 learning theory ACM 144–152.
21. Khan PM, Roy K (2018) Current approaches for choosing feature selection 41. Friedman JH (2001) Greedy function approximation: a gradient boosting
and learning algorithms in quantitative structure-activity relationships machine. Ann Stat 29:1189–1232
(QSAR). Expert Opin Drug Disc 13:1075–1089 42. Huang GB, Babri HA (1998) Upper bounds on the number of hidden neu-
22. Van Rossum G, Drake FL (2009) Python 3 Reference Manual. CreateSpace, rons in feedforward networks with arbitrary bounded nonlinear activation
CA functions. IEEE Trans Neural Netw 9:224–229
23. Gore PA (2000) Cluster Analysis. In: Tinsley HEA, Brown SD (eds) Handbook of 43. Ambure P, Bhat J, Puzyn T, Roy K (2019) Identifying natural compounds
applied multivariate statistics and mathematical modeling. Academic Press, as multi-target-directed ligands against Alzheimer’s disease: an in silico
San Diego, p 297 approach. J Biomol Struct Dyn 37:1282–1306
24. Mauri A, Consonni V, Pavan M, Todeschini R (2006) Dragon software: An 44. Mathea M, Klingspohn W, Baumann K (2016) Chemoinformatic classification
easy approach to molecular descriptor calculations. MATCH Commun Math methods and their applicability domain. Mol Inform 35:160–180
Comput Chem 56:237–248 45. Probst P, Boulesteix AL, Bischl B (2019) Tunability: importance of hyperpa-
25. Valdes-Martini JR, Marrero-Ponce Y, Garcia-Jacas CR, Martinez-Mayorga K, rameters of machine learning algorithms. J Mach Learn Res 20:1–32
Barigye SJ, Almeida YSV, Perez-Gimenez F, Morell CA (2017) QuBiLS-MAS, 46. Wu J, Chen X-Y, Zhang H, Xiong L-D, Lei H, Deng S-H (2019) Hyperparameter
open source multi-platform software for atom- and bond-based topologi- optimization for machine learning models based on bayesian optimization.
cal (2D) and chiral (2.5D) algebraic molecular descriptors computations. J J Electr Sci Technol 17:26–40
Cheminform 9:35 47. van der Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure
26. Speck-Planche A, Cordeiro MNDS (2017) De novo computational design of for efficient numerical computation. Comput Sci Eng 13:22–30
compounds virtually displaying potent antibacterial activity and desirable 48. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D,
in vitro ADMET profiles. Med Chem Res 26:2345–2356 Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson
27. Speck-Planche A (2020) Multi-scale QSAR approach for simultaneous mod- J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ,
eling of ecotoxic effects of pesticides. In: Roy K (ed) Ecotoxicological QSARs. Polat I, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R,
Springer, New York Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F,
28. Speck-Planche A (2018) Combining ensemble learning with a fragment- van Mulbregt P, Contributors S (2020) SciPy 1.0: Fundamental algorithms for
based topological approach to generate new molecular diversity in drug scientific computing in Python. Nat Methods 17:261–272
discovery: In silico design of Hsp90 inhibitors. ACS Omega 3:14704–14716 49. McKinney W (2010) Data structures for statistical computing in python,
29. Menzies T, Kocagüneli E, Minku L, Peters F, Turhan B (2015) Complexity: In: Proceedings of the 9th Python in Science Conference, Austin, Texas, 28
using assemblies of multiple models. In: Menzies T, Kocagüneli E, Minku L, June-3 July 2010.
Peters F, Turhan B (eds) Sharing data and models in software engineering. 50. Hunter JD (2007) Matplotlib: A 2D graphics environment. Comput Sci Eng
Morgan Kaufmann, Boston 9:90–95
30. Hao JG, Ho TK (2019) Machine learning made easy: a review of scikit-learn 51. Halder AK, Melo A, Cordeiro MNDS (2020) A unified in silico model based on
package in python programming language. J Educ Behav Stat 44:348–361 perturbation theory for assessing the genotoxicity of metal oxide nanopar-
31. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel ticles. Chemosphere 244:125489
M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau
D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in
python. J Mach Learn Res 12:2825–2830 Publisher’s Note
32. Wilks SS (1932) Certain generalizations in the analysis of variance. Biometrika Springer Nature remains neutral with regard to jurisdictional claims in pub-
24:471–494 lished maps and institutional affiliations.

Ready to submit your research ? Choose BMC and benefit from:

• fast, convenient online submission


• thorough peer review by experienced researchers in your field
• rapid publication on acceptance
• support for research data, including large and complex data types
• gold Open Access which fosters wider collaboration and increased citations
• maximum visibility for your research: over 100M website views per year

At BMC, research is always in progress.

Learn more biomedcentral.com/submissions

You might also like