0% found this document useful (0 votes)
10 views

Report Format Major 5 -4

The project report titled 'High-Potency Molecule Prediction Using AI-Driven Computational Model for Drug Discovery' discusses the integration of machine learning techniques in drug discovery, highlighting its ability to analyze complex datasets for efficient predictions. The report outlines the project's objectives, methodologies, and results, demonstrating how AI can enhance the drug development process. It is submitted by students of Sahyadri College of Engineering & Management as part of their Bachelor of Engineering in Artificial Intelligence and Data Science program.

Uploaded by

lionelmessi97210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Report Format Major 5 -4

The project report titled 'High-Potency Molecule Prediction Using AI-Driven Computational Model for Drug Discovery' discusses the integration of machine learning techniques in drug discovery, highlighting its ability to analyze complex datasets for efficient predictions. The report outlines the project's objectives, methodologies, and results, demonstrating how AI can enhance the drug development process. It is submitted by students of Sahyadri College of Engineering & Management as part of their Bachelor of Engineering in Artificial Intelligence and Data Science program.

Uploaded by

lionelmessi97210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

“JNANA SANGAMA”, BELAGAVI - 590 018

A PROJECT REPORT
on
“HIGH-POTENCY MOLECULE PREDICTION
USING AI-DRIVEN COMPUTATIONAL MODEL
FOR DRUG DISCOVERY”
Submitted by

ANURAG R POOJARY 4SF21AD008


B SRI SATYA SHRAVAN 4SF21AD013
RAYSON MININ FERNANDES 4SF21AD043
SHASHANK S K 4SF21AD048
In partial fulfillment of the requirements for the award of

BACHELOR OF ENGINEERING
in

ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


Under the Guidance of
Mr. SHARATHCHANDRA N R
Assistant Professor, Department of CSE(AI&ML)
at

SAHYADRI
College of Engineering & Management
An Autonomous Institution
MANGALURU
2024 - 25
SAHYADRI
College of Engineering & Management
An Autonomous Institution
MANGALURU
COMPUTER SCIENCE AND ENGINEERING
(ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING)

CERTIFICATE
This is to certify that the Project entitled “High-Potency Molecule Prediction
Using AI-Driven Computational Model for Drug Discovery” has been carried
out by Anurag R Poojary (4SF21AD008), B Sri Satya Shravan
(4SF21AD013), Rayson Minin Fernandes (4SF21AD043) and Shashank S K
(4SF21AD048), the bonafide students of Sahyadri College of Engineering &
Management in partial fulfillment of the requirements for the award of Bachelor of
Engineering in Artificial Intelligence and Data Science of Visvesvaraya
Technological University, Belagavi during the year 2024 - 25. It is certified that all
corrections/suggestions indicated for Internal Assessment have been incorporated in the
report deposited in the departmental library. The project report has been approved as
it satisfies the academic requirements in respect of project work prescribed for the said
degree.

———————————– ——————————— —————————–


Project Guide HoD Principal
Mr. Sharathchandra N R Dr. Pushpalatha K Dr. S S Injaganeri
Assistant Professor Professor & Head Principal
Dept. of CSE(AI&ML) Dept. of CSE(AI&ML) SCEM, Mangaluru

External Viva-Voce

Examiner’s Name Signature with Date

1. ......................................... .........................................
2. ......................................... .........................................
SAHYADRI
College of Engineering & Management
An Autonomous Institution
MANGALURU

Department of Computer Science and Engineering


(Artificial Intelligence and Machine Learning)

DECLARATION

We hereby declare that the entire work embodied in this Project Report titled
“High-Potency Molecule Prediction Using AI-Driven Computational Model
for Drug Discovery” has been carried out by us at Sahyadri College of Engineering
and Management, Mangaluru under the supervision of Mr. Sharathchandra N R for
the award of Bachelor of Engineering in Artificial Intelligence and Data
Science. This report has not been submitted to this or any other University.

Anurag R Poojary (4SF21AD008)


B Sri Satya Shravan (4SF21AD013)
Rayson Minin Fernandes (4SF21AD043)
Shashank S K (4SF21AD048)
Dept. of AI&ML, SCEM, Mangaluru
Abstract

The integration of machine learning (ML) techniques into drug discovery has
significantly transformed the pharmaceutical research landscape. ML methods are now
widely used to accelerate various stages of drug development, including target
identification, compound screening, and lead optimization.The ability of ML algorithms
to analyze complex datasets, such as protein-ligand interactions, chemical structures,
and biological responses, has enabled more efficient and accurate predictions in drug
design. Applications range from virtual screening and de novo drug design to drug
repurposing and toxicity prediction. Emerging areas, such as deep learning and
graph-based models, have further enhanced the predictive capabilities of ML,
facilitating the discovery of novel therapeutics. Additionally, advancements in
computational power, such as GPU-accelerated computing, have supported the
implementation of large-scale ML models, enabling the integration of diverse datasets
for a more holistic approach to drug discovery. This abstract highlights the
revolutionary role of ML in modern pharmaceutical research, emphasizing its potential
to address critical challenges in the development of effective and safe therapies.

i
Acknowledgement

It is with great satisfaction and euphoria that we are submitting the Project Report on
“High-Potency Molecule Prediction Using AI-Driven Computational Model
for Drug Discovery”. We have completed it as a part of the curriculum of
Visvesvaraya Technological University, Belagavi for the award of Bachelor of
Engineering in Artificial Intelligence and Data Science of Visvesvaraya
Technological University, Belagavi.

We are profoundly indebted to our guide, Mr. Sharathchandra N R, Assistant


Professor, Department of Computer Science and Engineering(AI&ML) for innumerable
acts of timely advice, encouragement and We sincerely express our gratitude.

We are profoundly indebted to Dr. Duddela Sai Prashanth , Associate Professor and
Project Work Coordinator, Department of Computer Science and Engineering(AI&ML)
for their invaluable support and guidance.

We express our sincere gratitude to Dr. Pushpalatha K, Professor & Head of the
Department of CSE(AI&ML) for her invaluable support and guidance.

We sincerely thank Dr. S. S. Injaganeri, Principal, Sahyadri College of Engineering &


Management,who have always been a great source of inspiration.

Finally, yet importantly, We express our heartfelt thanks to our family & friends for
their wishes and encouragement throughout the work.

Anurag R Poojary (4SF21AD008)


B Sri Satya Shravan (4SF21AD013)
Rayson Minin Fernandes (4SF21AD043)
Shashank S K (4SF21AD048)
Dept. of AI&ML, SCEM, Mangaluru

ii
Table of Contents

Abstract i

Acknowledgement ii

Table of Contents iv

List of Figures v

List of Tables v

1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Survey 4

3 Problem Formulation 9
3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Problem Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Software Requirements Specification 11


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 User Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.4 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4.1 Hardware Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4.2 Software Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 14

iii
5 System Design 15
5.1 System Architecture Diagram . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 State Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.4 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6 Implementation 23
6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.1.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 Implementation Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3.1 Bioactivity data concising . . . . . . . . . . . . . . . . . . . . . . 26
6.3.2 Polymerase basic protein2 (PB2) Exploratory Data Analysis . . . 27
6.3.3 Descriptor Dataset Preparation . . . . . . . . . . . . . . . . . . . 29
6.3.4 Random Forest Regressor implementation . . . . . . . . . . . . . 30
6.3.5 Streamlit Application for Predicting Potency of the molecule . . . 32

7 Results and Discussion 34


7.1 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.1 Home Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.2 Input File Processing . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.1.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.1.4 Timeline of the Project Work . . . . . . . . . . . . . . . . . . . . 39
7.1.5 Outcomes Obtained . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.1.6 Objectives Achieved . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.1.7 Challenges Encountered . . . . . . . . . . . . . . . . . . . . . . . 43

8 Conclusion and Future Work 47

Reference Inference 50

iv
List of Figures

5.1 System Architecture Diagram . . . . . . . . . . . . . . . . . . . . . . . . 15


5.2 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 State Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.1 Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.1 Home Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


7.2 Calculated molecular descriptors . . . . . . . . . . . . . . . . . . . . . . . 38
7.3 Result : prediction output . . . . . . . . . . . . . . . . . . . . . . . . . . 38

v
Chapter 1

Introduction

Bioinformatics is a powerful tool in the field of biopharmacy, especially in the process


of discovering new drugs. By combining biological data with advanced computational
techniques, bioinformatics allows scientists to analyze and interpret complex biological
information. This helps in identifying potential drug targets, such as specific genes or
proteins that play a key role in diseases, making the drug discovery process more efficient
and precise. The growing availability of large-scale biological datasets, such as genomic
and proteomic information, has further enhanced the role of bioinformatics, enabling
researchers to uncover deeper insights into disease mechanisms and drug action pathways.
In drug discovery, bioinformatics is used at various stages, from finding new drug
candidates to optimizing their properties and ensuring their safety. By using
computational methods to screen and design compounds, predict their behavior in the
body, and analyze clinical trial data, bioinformatics accelerates the development of new
medications. This integration of technology and biology is crucial for creating effective
and safe drugs more quickly and cost-effectively. Moreover, bioinformatics reduces
reliance on traditional trial-and-error approaches by offering data-driven solutions,
making the entire drug discovery pipeline more streamlined and innovative.

1.1 Overview
Bioinformatics serves as a cornerstone in biopharmacy, particularly in drug discovery
and development, by merging biological insights with computational innovation. It
enables researchers to analyze vast datasets, such as genomic, proteomic, and
metabolomic information, to identify novel drug targets and understand disease
mechanisms. Using advanced algorithms and machine learning, bioinformatics tools

1
High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 1

facilitate the virtual screening of extensive chemical libraries, accelerating the


identification of lead compounds with high potential. Furthermore, these tools provide
predictive models to anticipate the interactions between drugs and their targets,
enhancing the accuracy and efficiency of early-stage drug discovery.
Beyond target discovery and screening, bioinformatics plays a vital role in
optimizing drug design. It helps predict how drug molecules interact with biological
systems, assessing factors like binding affinity, toxicity, and pharmacokinetics. This
predictive capability reduces the dependency on time-intensive and costly experimental
approaches. Additionally, bioinformatics contributes to the repurposing of existing
drugs by analyzing their potential applications in new therapeutic areas. It also
supports precision medicine by enabling the design of drugs tailored to individual
genetic profiles, addressing the variability in patient responses to treatments.

1.2 Purpose
Bioinformatics is essential in biopharmacy for drug discovery, as it combines biological
data with computational tools to identify and optimize new drug candidates. It helps find
potential drug targets, screen large compound libraries, design effective molecules, and
predict how these compounds will behave in the body. By analyzing clinical trial data and
predicting a drug’s safety and efficacy, bioinformatics makes the drug development process
faster, more efficient, and cost-effective. Additionally, the integration of bioinformatics
ensures a higher success rate in clinical trials by improving the accuracy of preclinical
predictions and identifying biomarkers for patient stratification.
Moreover, bioinformatics empowers researchers to address unmet medical needs by
exploring alternative therapeutic strategies, such as drug repurposing or combination
therapies. Its application extends beyond drug discovery to include vaccine design, gene
therapy development, and the study of antimicrobial resistance, making it a versatile tool
in modern biopharmacy.

1.3 Scope
The scope of this project involves using bioinformatics tools to enhance drug discovery and
development. It includes identifying potential drug targets through the analysis of genetic
and protein data, screening and designing drug candidates, and predicting their behavior

Department of CSE(AI&ML), SCEM, Mangaluru Page 2


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 1

and safety in the body. The project will also focus on optimizing these compounds to
improve their effectiveness and minimize side effects. By integrating bioinformatics into
every stage of the drug development process, the project aims to accelerate the creation of
new, safe, and effective medications while reducing costs and improving patient outcomes.
Furthermore, the project will explore the use of machine learning algorithms and
advanced data visualization techniques to gain actionable insights from large datasets.
It will also evaluate the impact of structural modifications on drug efficacy and safety
profiles, providing a comprehensive understanding of compound behavior. The
application of bioinformatics in identifying biomarkers for precision medicine will also
be a critical aspect, ensuring that treatments are tailored to individual patients. This
comprehensive approach will not only enhance the efficiency of drug discovery but also
contribute to advancements in personalized medicine and global healthcare.

Department of CSE(AI&ML), SCEM, Mangaluru Page 3


Chapter 2

Literature Survey

Bioinformatics plays a pivotal role in modern drug discovery, leveraging computational


methods and biological data to enhance the identification and optimization of drug
candidates. This section outlines existing research that integrates bioinformatics and
machine learning (ML) techniques to address challenges in drug development.

Machine Learning in Drug Discovery Patel et al. (2020) reviewed machine


learning methods in drug discovery, emphasizing their application in streamlining
compound screening and predicting drug-target interactions. The study highlights how
ML models can significantly reduce the time and cost associated with drug development
by improving the accuracy of early-stage predictions. Additionally, the authors
identified challenges related to model interpretability and scalability, suggesting future
directions for enhancing ML-driven tools in drug discovery.

Machine Learning in Drug Discovery: A Review Dara et al. (2021) discussed the
transformative impact of ML on drug discovery processes, focusing on its use in target
identification, lead optimization, and clinical trial analysis. They emphasized how ML
models handle large-scale datasets efficiently, enabling better predictions and reducing
drug development costs. The study also highlights the potential of combining ML with
genomics data to identify personalized therapeutic targets for complex diseases.

MISATO: Machine Learning Dataset of Protein–Ligand Complexes for


Structure-Based Drug Discovery Siebenmorgen et al. (2024) introduced the
MISATO dataset, which integrates protein-ligand data with advanced ML algorithms.
The dataset enables precise modeling of molecular interactions, facilitating efficient
drug design. The study also highlighted its potential in virtual screening, particularly
for challenging targets such as membrane proteins. The authors discussed the

4
High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 2

importance of open-access datasets in driving collaborative research and innovation in


drug discovery.

Utilizing Graph Machine Learning in Drug Discovery and Development


Gaudelet et al. (2021) explored graph machine learning for modeling biomolecular
interactions and complex biological systems. Their study demonstrated the ability of
graph-based models to predict drug-target interactions and molecular properties. They
also proposed a framework for integrating graph neural networks into traditional
pipelines, improving the scalability and accuracy of predictions in large datasets. This
work underscores the importance of graph representations in modern drug discovery.

Novel Big Data-Driven Machine Learning Models for Drug Discovery


Applications Akondi et al. (2022) highlighted the role of big data and ML in
identifying potential drug candidates through virtual screening and toxicity prediction.
The study emphasized how ML models can handle diverse datasets, including genomic,
proteomic, and metabolomic data, to provide actionable insights. They also discussed
challenges in integrating multi-modal data, proposing strategies for improving model
performance and robustness.

Structure-Based Drug Discovery with Deep Learning Çelik et al. (2023) focused
on the application of deep learning in structure-based drug discovery. They discussed how
convolutional and recurrent neural networks are used to predict protein-ligand binding
affinities with high accuracy. The study also explored the use of transfer learning for
adapting pre-trained models to new drug targets, enhancing the efficiency of virtual
screening. Their findings underscore the potential of AI in automating complex tasks in
the drug discovery pipeline.

Artificial Intelligence and Machine Learning in Drug Discovery and


Development Burbidge et al. (2001) demonstrated the utility of Support Vector
Machines (SVMs) in pharmaceutical data analysis for drug design. The study
highlighted how SVMs analyze structure-activity relationships (SAR), leading to
accurate predictions of bioactivity. This foundational work paved the way for ML
adoption in drug discovery, emphasizing the importance of feature engineering and
model validation in ensuring reliability.

Emergence of Drug Discovery in Machine Learning Carpenter and Huang et al.


(2018) examined the role of ML-based virtual screening in Alzheimer’s drug discovery.
They utilized predictive models to identify lead compounds and optimize their

Department of CSE(AI&ML), SCEM, Mangaluru Page 5


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 2

properties. The study emphasized the value of integrating ML with traditional


pharmacological methods to address the challenges of neurodegenerative diseases. Their
findings suggest that ML can accelerate the development of targeted therapies for
complex disorders.

Comparison of Conventional Statistical Methods with Machine Learning in


Medicine: Diagnosis, Drug Development, and Treatment Rodrigues and
Bernardes et al. (2020) provided a comparative analysis of conventional statistical
methods and ML techniques in drug development. They highlighted how ML
algorithms outperform traditional methods in target discovery and lead optimization.
The study also discussed the challenges of interpreting ML models and proposed
solutions to enhance transparency and reproducibility in drug discovery pipelines.

Ranking Chemical Structures for Drug Discovery: A New Machine Learning


Approach Hudson et al. (2020) introduced a novel ML framework for ranking chemical
structures based on their therapeutic potential. The study utilized ensemble learning
methods to improve prediction accuracy, demonstrating their efficacy in narrowing down
large chemical libraries. The authors also emphasized the role of high-quality training
datasets in enhancing the reliability of ML-based virtual screening.

Machine Learning and Image-Based Profiling in Drug Discovery Scheeder et


al. (2018) explored how ML and high-throughput imaging can accelerate drug discovery.
Their research illustrated the use of convolutional neural networks (CNNs) in phenotypic
screening to identify bioactive compounds. The study highlighted the importance of
image-based data in understanding cellular responses to drug candidates, providing a
new dimension to preclinical studies.

Revolutionizing Pharmaceutical Research: Harnessing Machine Learning for


a Paradigm Shift in Drug Discovery Husnain et al. (2023) discussed the
transformative impact of ML on pharmaceutical research, particularly in reducing drug
development timelines and costs. They highlighted applications such as toxicity
prediction, ADMET profiling, and biomarker identification. The study also emphasized
the role of AI in democratizing drug discovery by enabling researchers from diverse
backgrounds to contribute.

Generative Machine Learning for De Novo Drug Discovery: A Systematic


Review Martinelli et al. (2022) reviewed the application of generative ML techniques
like GANs and VAEs in de novo drug discovery. The study demonstrated how these

Department of CSE(AI&ML), SCEM, Mangaluru Page 6


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 2

models design novel compounds with desired properties. The authors discussed challenges
in training generative models, including the need for large, high-quality datasets. Their
findings highlight the potential of generative ML to revolutionize the creation of new
drugs.

Machine Learning for Target Discovery in Drug Development Wang et al.


(2022) focused on the application of ML techniques in identifying novel drug targets.
The study highlighted the ability of ML models to analyze large-scale omics data and
identify key molecular pathways involved in diseases. The authors emphasized the
potential of integrating ML with CRISPR-based screening methods to validate drug
targets effectively.

The Transformational Role of GPU Computing and Deep Learning in Drug


Discovery Agarwal et al. (2010) explored how GPU computing and deep learning have
transformed computational drug discovery. The study demonstrated how accelerated
computations enable the efficient screening of vast chemical libraries. The authors also
discussed the role of deep neural networks in improving the accuracy of protein-ligand
binding predictions.

Drug Design by Machine Learning: Support Vector Machines for


Pharmaceutical Data Analysis Roy et al. (2021) examined the application of SVMs
in analyzing pharmaceutical data for drug design. The study highlighted how SVMs can
predict biological activities and optimize lead compounds. The authors emphasized the
role of kernel functions in enhancing the performance of SVMs in diverse drug discovery
tasks.

Machine Learning-Based Virtual Screening and Its Applications to


Alzheimer’s Drug Discovery Shahab et al. (2023) utilized ML-based virtual
screening techniques to identify potential drugs for Alzheimer’s disease. The study
demonstrated the use of docking simulations combined with ML algorithms to prioritize
compounds for further testing. Their findings emphasize the importance of integrating
computational and experimental approaches in tackling neurodegenerative diseases.

Data Integration Using Advances in Machine Learning in Drug Discovery and


Molecular Biology Quazi and Fatima et al. (2023) analyzed the dual role of AI and
ML in drug discovery and repurposing. The study showcased how computational tools
identify new uses for existing drugs, addressing unmet medical needs. The authors also
discussed the challenges of integrating multi-omics data for more accurate predictions,

Department of CSE(AI&ML), SCEM, Mangaluru Page 7


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 2

emphasizing the need for robust algorithms.

Role of Artificial Intelligence and Machine Learning in Drug Discovery and


Drug Repurposing Hassan et al. (2023) reviewed the growing role of AI and ML
in drug discovery and repurposing. The study highlighted applications such as virtual
screening, toxicity prediction, and molecular docking. The authors also discussed ethical
considerations in AI-driven drug discovery, particularly in ensuring transparency and
fairness.

Machine Learning-Based Drug Design for Identification of Thymidylate


Kinase Inhibitors as Potential Anti-Mycobacterium Tuberculosis Agents Ali
et al. (2023) applied ML-based drug design techniques to identify inhibitors for
Mycobacterium tuberculosis. The study combined docking simulations with ML models
to predict binding affinities, providing a framework for developing new antimicrobial
agents. The authors emphasized the importance of interdisciplinary approaches in
addressing global health challenges.

Department of CSE(AI&ML), SCEM, Mangaluru Page 8


Chapter 3

Problem Formulation

3.1 Problem Statement


In the field of drug discovery, identifying novel drug candidates with high efficacy and
safety profiles remains a significant challenge. Leveraging the ChemBL database, which
contains a vast repository of bioactivity data for small molecules, presents an
opportunity to expedite the process of identifying potential drug targets and optimizing
lead compounds. However, extracting meaningful insights from this extensive dataset
requires advanced bioinformatics tools and methodologies. The objective of this study
is to utilize the ChemBL database to predict and prioritize promising drug candidates
by integrating bioactivity, structural, and pharmacokinetic data. By doing so, we aim
to accelerate the discovery of new therapeutic agents that address unmet medical needs
effectively and efficiently.

3.2 Problem Introduction


Drug discovery is a complex and resource-intensive process, with one of the primary
challenges being the identification of new drug candidates that are both effective and
safe. Addressing this challenge requires leveraging advanced tools and data resources.
The ChemBL database, a comprehensive repository of bioactivity information for small
molecules, offers significant potential to streamline the drug discovery process. This
database allows researchers to identify potential drug targets and refine lead compounds
efficiently.
However, extracting actionable insights from the vast amount of data available in
ChemBL necessitates sophisticated bioinformatics and computational approaches. The

9
High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 3

goal of this study is to harness the power of the ChemBL database to predict and rank
promising drug candidates by analyzing bioactivity, structural properties, and
pharmacokinetic profiles. Through this approach, the aim is to enhance the efficiency of
the drug discovery pipeline and accelerate the development of innovative therapeutic
solutions to address unmet medical needs. This work seeks to integrate data-driven
methods with biological research to bring forth effective and impactful advancements in
drug development.

3.3 Objectives
• Data Mining and Bioactivity Analysis: Perform systematic data mining to identify
small molecules with significant bioactivity against selected targets associated with
specific diseases or biological processes.

• Computational Screening and Prioritization: Implement computational screening


techniques to prioritize candidate molecules based on their predicted binding
affinity, pharmacokinetic properties, and safety profiles. Develop and apply virtual
screening algorithms to filter and rank potential drug candidates from the
ChEMBL database.

• Validation and Lead Optimization: Integrate experimental data and


computational predictions to iteratively refine the list of prioritized drug
candidates, ensuring robustness and reliability in subsequent stages of preclinical
and clinical development.

Department of CSE(AI&ML), SCEM, Mangaluru Page 10


Chapter 4

Software Requirements Specification

4.1 Introduction
The software requirements for this project outline the functionalities, interfaces, and
system components essential for developing a bioinformatics system for drug discovery.
These requirements ensure that the system meets the needs of researchers and
pharmaceutical stakeholders by providing accurate, timely, and user-friendly
predictions. The requirements are divided into functional requirements, which define
the system’s specific behaviors, and non-functional requirements, which establish
performance and quality criteria.

4.1.1 Requirements

1. Data Collection

Collecting the data from the CHEMBL database and extracting the relevant
features, and searching for the target protein of the virus

2. Data Preprocessing

The collected data is cleaned and preprocessed to address missing values, rectify
inconsistencies, and standardize inputs, ensuring high-quality data for analysis.

3. Machine Learning Model

The system implements and trains advanced machine learning Random Forest
Model, to predict drug-target interactions and optimize lead compounds.

4. User Interface

11
High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 4

A user-friendly web based interface to enable researchers to input biological


datasets, visualize results, and receive clear, actionable predictions.

5. Prediction Output

The system generates concise and informative reports on predicted drug efficacy,
complemented by visualizations to support decision-making.

4.2 Purpose
• Enhance the ability to predict drug efficacy and safety by analyzing biological data
and leveraging advanced machine learning techniques.

• Support researchers and pharmaceutical teams in identifying high-potential drug


candidates for further development.

• Employ ensemble learning techniques, such as Random Forest and XGBoost, to


ensure high accuracy and reliability in drug discovery predictions.

• Promote faster, cost-effective drug development strategies by utilizing


computational models to prioritize lead compounds.

• Contribute to the advancement of therapeutic solutions by integrating


bioinformatics and machine learning in the drug discovery process.

4.3 User Characteristics


The bioinformatics system for drug discovery is designed to cater to various user groups,
each with distinct roles, technical expertise, and specific needs. The software design
addresses the following user characteristics:

1. Researchers
Role: Scientists engaged in analyzing biological data to identify drug targets and evaluate
candidate compounds.
Technical Proficiency: Moderate to high, with familiarity in bioinformatics, data
analysis, and computational tools.

Department of CSE(AI&ML), SCEM, Mangaluru Page 12


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 4

2. Drug Development Teams


Role: Multidisciplinary teams focused on advancing promising drug candidates to clinical
trials.
Technical Proficiency: High, with expertise in pharmacology, cheminformatics, and
computational modeling.

3. System Administrators
Role: IT professionals responsible for maintaining the system infrastructure, ensuring
data security, and ensuring system reliability.
Technical Proficiency: High, with expertise in database management, server
operations, and deployment of bioinformatics software systems.

4. Data Scientists
Role: Experts in machine learning and data modeling tasked with optimizing predictive
algorithms for drug discovery.
Technical Proficiency: High, with skills in programming, machine learning,
bioinformatics data processing, and statistical analysis.

4.4 Interfaces

4.4.1 Hardware Interfaces

• Processor: Multi-core processor compatible with machine learning workloads.

• Operating System: Cross-platform compatibility, including Windows, Linux, and


macOS.

• Memory: Minimum 16 GB RAM for efficient data processing.

• Development Environment: Jupyter Notebook, Visual Studio Code (VSCode),


Python environment.

Department of CSE(AI&ML), SCEM, Mangaluru Page 13


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 4

4.4.2 Software Interfaces

• User Interface (UI): An intuitive platform developed using python and streamlit to
input biological datasets and view predictions, including interactive visualizations
of drug-target interactions.

• Backend Service Interface: Facilitates communication between the user interface


and machine learning models using streamlit to ensure seamless operation and data
flow.

• Machine Learning Model Interface: Incorporated algorithm Random Forest enable


real-time predictions of drug efficacy based on input data.

Department of CSE(AI&ML), SCEM, Mangaluru Page 14


Chapter 5

System Design

5.1 System Architecture Diagram


In this project, the architecture diagram will be used to visually represent the design
and workflow of the bioinformatics system. It will illustrate the key components, their
interactions, and how they contribute to predicting pEC50 values. This diagram will
serve as a guide to understand the system’s structure, data flow, and the integration of
computational tools and machine learning models, ensuring clarity and alignment with
the project’s objectives.

Figure 5.1: System Architecture Diagram

The process begins by collecting biological data, such as DNA, RNA, or protein
sequences, from databases or laboratory experiments. This data includes genomic

15
High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 5

sequences, annotations, and experimental results. Once the data is collected, it


undergoes pre-processing, which involves cleaning, normalization, and transformation
into structured formats suitable for computational analysis. Feature selection is then
carried out to identify the most informative attributes, such as sequence motifs,
secondary structures, or molecular descriptors. This step reduces dimensionality and
enhances the efficiency and accuracy of the predictive models.
Machine learning techniques, including Random Forest, XGBoost, and ensemble
methods, are applied to analyze the curated datasets. These models are trained to
predict biological properties, such as protein function, binding affinities, or disease
associations. The predictive analysis aims to uncover patterns and relationships within
the data, enabling a deeper understanding of biological systems.
The final stage involves knowledge discovery, where the results are interpreted to
provide actionable insights. A web-based application, developed using Streamlit,
enables researchers and stakeholders to input sequence data and receive predictions or
biological insights in real-time. This integrated pipeline, encompassing data collection,
preprocessing, feature selection, machine learning, and interpretation, advances the field
of bioinformatics by offering tools for genomics, proteomics, and systems biology
research.

5.2 Class Diagram


A class diagram in bioinformatics represents the structural components of the system,
showcasing its classes, attributes, methods, and relationships. It facilitates the
understanding of data flow and interactions within the system.
The system’s class-based architecture focuses on the analysis of biological sequences
and experimental data. Key components include:

• InputFile: This class is responsible for handling the input data in the form of
SMILES strings and their associated chemical identifiers (chemblID). It includes
the following:

– Attributes:

∗ smiles: A string containing the SMILES representation of a molecule.

∗ chemblID: A string representing the unique identifier of the compound


from the ChEMBL database.

Department of CSE(AI&ML), SCEM, Mangaluru Page 16


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 5

Figure 5.2: Class Diagram

Department of CSE(AI&ML), SCEM, Mangaluru Page 17


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 5

– Methods:

∗ loadFile(): Loads the input data from a file.

∗ validateFile(): Validates the integrity and format of the input data.

• DescriptorGeneration: This class handles the generation of molecular


descriptors from the provided SMILES strings. It includes:

– Attributes:

∗ molecularDescriptors: A list of descriptors derived from SMILES


strings.

– Methods:

∗ generateDescriptors(smiles: String): Processes SMILES strings to


generate molecular descriptors.

• FeatureSelection: This class is responsible for selecting relevant features from the
generated descriptors. It includes:

– Attributes:

∗ selectedFeatures: A list of the most predictive molecular descriptors.

– Methods:

∗ removeLowVarianceFeatures(features: List<String>): Eliminates


descriptors with low variance to improve model performance.

• MachineLearningModel: This class implements the machine learning framework


for training and predicting pIC50 values. It includes:

– Attributes:

∗ trainedModel: Stores the trained machine learning model.

– Methods:

∗ trainModel(data: List<String>): Trains the model on the provided


dataset.

∗ predictpIC50(data: List<String>): Predicts the pIC50 value for a


given set of molecular descriptors.

• PredictionProcess: This class coordinates the entire pipeline from input


processing to prediction. It includes:

Department of CSE(AI&ML), SCEM, Mangaluru Page 18


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 5

– Attributes:

∗ inputFile: An instance of the InputFile class.

∗ descriptors: An instance of the DescriptorGeneration class.

∗ featureSelection: An instance of the FeatureSelection class.

∗ model: An instance of the MachineLearningModel class.

– Methods:

∗ processPrediction(): Executes the entire prediction pipeline, from


input to output.

• Result: This class manages the output and presentation of predicted pIC50 values.
It includes:

– Attributes:

∗ predictedpIC50: The predicted pIC50 value as a float.

– Methods:

∗ displayResult(): Displays the predicted pIC50 value to the user.

These components interact to provide a seamless pipeline from data acquisition to


predictive analysis. The class model enables the integration of diverse data types while
maintaining modularity, ensuring adaptability for various bioinformatics applications.

5.3 State Diagram


A state diagram is a UML behavioral representation that models the dynamic behavior
of a bioinformatics system, describing how it transitions between states based on events
or conditions.
The state diagram illustrates the workflow for bioinformatics data analysis. The
system starts in the ”Data Collection” state, retrieving raw biological data from databases
or experiments. It then transitions to ”Preprocessing,” where the data is cleaned and
structured. Following this, the system enters the ”Feature Selection” state, identifying
key biological features such as motifs, descriptors, or annotations.
Next, the system moves to the ”Model Training” state, where machine learning
algorithms are applied to train models on labeled data. The trained models are
evaluated for accuracy in the ”Model Evaluation” state. Once validated, the system

Department of CSE(AI&ML), SCEM, Mangaluru Page 19


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 5

Figure 5.3: State Diagram

Department of CSE(AI&ML), SCEM, Mangaluru Page 20


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 5

transitions to ”Prediction,” where users can input new data to receive predictions and
insights. Finally, the system enters the ”Knowledge Discovery” state, generating
actionable insights and hypotheses for further research.
This workflow highlights the progression from raw data to actionable biological
insights, ensuring a robust and dynamic bioinformatics analysis pipeline.

5.4 Use Case Diagram


A Use Case Diagram is a UML tool that visually represents the functional requirements
of a bioinformatics system. It illustrates the interactions between users and system
functionalities.
The bioinformatics system, as shown in Figure 5.4, supports tasks such as data
retrieval, analysis, and interpretation. Users, including researchers and
bioinformaticians, can interact with the system to upload biological data, initiate
feature extraction, and train predictive models.
The system preprocesses and normalizes the data, identifies critical features, and uses
machine learning models such as Random Forest and XGBoost to perform tasks like
sequence classification, protein function prediction, or disease association studies. By
integrating machine learning and biological insights, the system enables efficient data
analysis and provides tools for hypothesis generation, advancing research in genomics,
proteomics, and molecular biology.

Department of CSE(AI&ML), SCEM, Mangaluru Page 21


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 5

Figure 5.4: Use Case Diagram

Department of CSE(AI&ML), SCEM, Mangaluru Page 22


Chapter 6

Implementation

6.1 Algorithm
Algorithm is a step-by-step, systematic procedure or set of rules designed to perform a
specific task or solve a problem. Algorithms are fundamental to computer science and
are used to manipulate data, perform calculations, or automate reasoning tasks.

6.1.1 Random Forest

It is an ensemble learning algorithm that builds multiple decision trees and combines
their results to improve predictive accuracy. It is used to model complex interactions
between microbial data and crop yields, helping to predict agricultural outcomes based
on microbial activity in the soil

23
High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 6

Algorithm 1 Random Forest Training Algorithm for Crop Yield Prediction

1. procedure Train RandomForest(processedDataset)

2. Split dataset into training (Xtrain , ytrain ) and testing (Xtest , ytest ) sets

3. Initialize RandomForestRegressor with n estimators = 100, max depth = 10

4. Train model on (Xtrain , ytrain )

5. Predict on test set: ypred ← model.predict(Xtest )

6. Compute MSE and R2 score:

MSE ← mean squared error(ytest , ypred )

R2 ← r2 score(ytest , ypred )

7. return trained model

8. end procedure

6.2 Flow Chart


A flowchart is a graphical representation of a process, system, or algorithm that uses
various symbols to represent the flow of control and data. Flowcharts are used to visualize
the sequence of steps involved in solving a problem or completing a task. They provide
a clear and simple way to represent complex processes or algorithms.

• Data Collection: Retrieve bioactivity data from the ChEMBL database, which
provides details about compound activities against biological targets.

• Descriptor Calculation: Calculate molecular descriptors, including


physicochemical properties (e.g., Lipinski’s Rule of Five) and structural
fingerprints, to characterize compounds.

• Data Transformation: Convert IC50 values into pEC50 values using a


logarithmic scale for normalization and better interpretability.

• Dataset Preparation: Combine molecular descriptors with bioactivity data into


a single dataset. Perform preprocessing tasks, such as cleaning, normalization, and

Department of CSE(AI&ML), SCEM, Mangaluru Page 24


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 6

Figure 6.1: Flow Chart

Department of CSE(AI&ML), SCEM, Mangaluru Page 25


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 6

structuring the dataset for analysis.

• Machine Learning: Train a Random Forest regression model on the prepared


dataset to predict compound activities. The algorithm learns from historical data
to make predictions on unseen compounds.

• Prediction: Use the trained model to predict the activity of new compounds based
on their molecular descriptors.

• Evaluation: Analyze the performance of the model using evaluation metrics


(e.g., R², mean squared error) to assess prediction accuracy. Identify promising
compounds for further investigation.

6.3 Implementation Codes

6.3.1 Bioactivity data concising

1 ! pip install c h e m b l _ w e b r e s o u r c e _ c l i e n t
2

3 # Import necessary libraries


4 import pandas as pd
5 from c h e m b l _ w e b r e s o u r c e _ c l i e n t . new_client import new_client
6

7 # Search for the target polymerase basic protein2 ( PB2 )


8 target = new_client . target
9 target_query = target . search ( ’ polymerase basic protein2 ( PB2 ) ’)
10 targets = pd . DataFrame . from_dict ( target_query )
11 targets
12

13 # Select the target ChEMBL ID


14 selected_target = targets . target_chembl_id [0]
15 selected_target
16

17 # Retrieve activity data for the selected target


18 activity = new_client . activity
19 res = activity . filter ( target_chembl_id = selected_target ) . filter (
standard_type = " EC50 " )
20

21 # Convert the results into a pandas DataFrame


22 df = pd . DataFrame . from_dict ( res )

Department of CSE(AI&ML), SCEM, Mangaluru Page 26


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 6

23 df
24

25 # Save raw bioactivity data to CSV


26 df . to_csv ( ’ p o l y m e r a s e _ b a s i c _ p r o t e i n 2 _ ( PB2 ) _ 0 1 _ b i o a c t i v i t y _ d a t a _ r a w . csv ’
, index = False )
27

28 # Filter out rows with missing values in ’ standard_value ’ and ’


canonical_smiles ’
29 df2 = df [ df . standard_value . notna () ]
30 df2 = df2 [ df . canonical_smiles . notna () ]
31 df2
32

33 # Remove duplicate canonical SMILES


34 df2_nr = df2 . drop_duplicates ([ ’ canonical_smiles ’ ])
35 df2_nr
36

37 # Select relevant columns


38 selection = [ ’ mo le cu le _c he mb l_ id ’ , ’ canonical_smiles ’ , ’ standard_value ’
]
39 df3 = df2_nr [ selection ]
40 df3
41

42 # Save preprocessed bioactivity data to CSV


43 df3 . to_csv ( ’ p o l y m e r a s e _ b a s i c _ p r o t e i n 2 _ ( PB2 )
_ 0 2 _ b i o a c t i v i t y _ d a t a _ p r e p r o c e s s e d . csv ’ , index = False )

6.3.2 Polymerase basic protein2 (PB2) Exploratory Data


Analysis

1 from rdkit import Chem


2 from rdkit . Chem import Descriptors , Lipinski
3 import pandas as pd
4

5 # Load the bioactivity data


6 df = pd . read_csv ( ’/ Users / raysonfernandes / Desktop / project /
p o l y m e r a s e _ b a s i c _ p r o t e i n 2 _ ( PB2 ) _ 0 1 _ b i o a c t i v i t y _ d a t a _ r a w . csv ’)
7 df
8

9 # Remove the ’ canonical_smiles ’ column temporarily


10 df_no_smiles = df . drop ( columns = ’ canonical_smiles ’)

Department of CSE(AI&ML), SCEM, Mangaluru Page 27


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 6

11

12 # Clean and extract the longest SMILES for compounds with multiple
representations
13 smiles = []
14 for i in df . canonical_smiles . tolist () :
15 cpd = str ( i ) . split ( ’. ’)
16 cpd_longest = max ( cpd , key = len )
17 smiles . append ( cpd_longest )
18

19 # Create a new DataFrame with cleaned SMILES


20 smiles = pd . Series ( smiles , name = ’ canonical_smiles ’)
21 df_clean_smiles = pd . concat ([ df_no_smiles , smiles ] , axis =1)
22 df_clean_smiles
23

24 # Define Lipinski descriptors calculation function


25 def lipinski ( smiles , verbose = False ) :
26 moldata = []
27 for elem in smiles :
28 mol = Chem . MolFromSmiles ( elem )
29 moldata . append ( mol )
30

31 baseData = np . arange (1 , 1)
32 i = 0
33 for mol in moldata :
34 desc_MolWt = Descriptors . MolWt ( mol )
35 desc_MolLogP = Descriptors . MolLogP ( mol )
36 desc_NumHDonors = Lipinski . NumHDonors ( mol )
37 de sc _N um HA cc ep to rs = Lipinski . NumHAcceptors ( mol )
38

39 row = np . array ([ desc_MolWt , desc_MolLogP , desc_NumHDonors ,


d esc _N um HA cc ep to rs ])
40 if i == 0:
41 baseData = row
42 else :
43 baseData = np . vstack ([ baseData , row ])
44 i += 1
45

46 columnNames = [ " Molecular_Weight " , " LogP " , " Num_H_Donors " , "
Num_H_Acceptors " ]
47 descriptors = pd . DataFrame ( data = baseData , columns = columnNames )
48 return descriptors

Department of CSE(AI&ML), SCEM, Mangaluru Page 28


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 6

49

50 # Apply Lipinski rule calculation


51 df_lipinski = lipinski ( df_clean_smiles . canonical_smiles )
52 df_combined = pd . concat ([ df_clean_smiles , df_lipinski ] , axis =1)
53 df_combined
54

55 # Save the final data to CSV


56 df_combined . to_csv ( ’/ Users / raysonfernandes / Desktop / project /
p o l y m e r a s e _ b a s i c _ p r o t e i n 2 _ ( PB2 ) _ 0 3 _ l i p i n s k i _ d e s c r i p t o r s . csv ’ , index =
False )

6.3.3 Descriptor Dataset Preparation

1 # Download necessary files


2 ! wget https :// github . com / dataprofessor / bioinformatics / raw / master / padel
. zip
3 ! wget https :// github . com / dataprofessor / bioinformatics / raw / master / padel
. sh
4

5 # Unzip the downloaded file


6 ! unzip padel . zip
7

8 # Import required library


9 import pandas as pd
10

11 # Load the dataset


12 df3 = pd . read_csv ( ’ p o l y m e r a s e _ b a s i c _ p r o t e i n 2 _ ( PB2 )
_ 0 4 _ b i o a c t i v i t y _ d a t a _ 3 c l a s s _ p E C 5 0 . csv ’)
13 df3
14

15 # Select relevant columns


16 selection = [ ’ canonical_smiles ’ , ’ mo le cu le _c he mb l_ id ’]
17 df3_selection = df3 [ selection ]
18 df3_selection . to_csv ( ’ molecule . smi ’ , sep = ’\ t ’ , index = False , header =
False )
19

20 # Preview the SMILES file


21 ! cat molecule . smi | head -5
22

23 # Count the number of lines in the SMILES file

Department of CSE(AI&ML), SCEM, Mangaluru Page 29


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 6

24 ! cat molecule . smi | wc -l


25

26 # Display the Padel descriptor script


27 ! cat padel . sh
28

29 # Run the Padel descriptor generation script


30 ! bash padel . sh
31

32 # List generated files


33 ! ls -l
34

35 # Load the Padel descriptor output


36 df3_X = pd . read_csv ( ’ de sc ri pt or s_ ou tp ut . csv ’)
37 df3_X
38

39 # Drop the ’ Name ’ column


40 df3_X = df3_X . drop ( columns =[ ’ Name ’ ])
41 df3_X
42

43 # Extract the target variable


44 df3_Y = df3 [ ’ pEC50 ’]
45 df3_Y
46

47 # Combine descriptors and target variable into one dataset


48 dataset3 = pd . concat ([ df3_X , df3_Y ] , axis =1)
49 dataset3
50

51 # Save the final dataset to CSV


52 dataset3 . to_csv ( ’ p o l y m e r a s e _ b a s i c _ p r o t e i n 2 _ ( PB2 )
_ 0 6 _ b i o a c t i v i t y _ d a t a _ 3 c l a s s _ p I C 5 0 _ p u b c h e m _ f p . csv ’ , index = False )

6.3.4 Random Forest Regressor implementation

1 import pandas as pd
2 import seaborn as sns
3 from sklearn . model_selection import train_test_split
4 from sklearn . ensemble import R a n d o m F o r e s t R e g r e s s o r
5

6 # Load the dataset


7 df = pd . read_csv ( ’ p o l y m e r a s e _ b a s i c _ p r o t e i n 2 _ ( PB2 )

Department of CSE(AI&ML), SCEM, Mangaluru Page 30


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 6

_ 0 6 _ b i o a c t i v i t y _ d a t a _ 3 c l a s s _ p E C 5 0 _ p u b c h e m _ f p . csv ’)
8

9 # Split the dataset into features ( X ) and target variable ( Y )


10 X = df . drop ( ’ pEC50 ’ , axis =1)
11 Y = df . pEC50
12

13 # Display shapes of X and Y


14 X . shape
15 Y . shape
16

17 # Perform feature selection using Varia nceThr eshold


18 from sklearn . fe ature_ select ion import Va riance Thresh old
19 selection = Varian ceThre shold ( threshold =(.8 * (1 - .8) ) )
20 X = selection . fit_transform ( X )
21

22 # Updated shape after feature selection


23 X . shape
24

25 # Split data into training and testing sets


26 X_train , X_test , Y_train , Y_test = train_test_split (X , Y , test_size
=0.2)
27 X_train . shape , Y_train . shape
28 X_test . shape , Y_test . shape
29

30 # Train a Random Forest Regressor


31 model = R a n d o m F o r e s t R e g r e s s o r ( n_estimators =100)
32 model . fit ( X_train , Y_train )
33

34 # Calculate R ^2 score
35 r2 = model . score ( X_test , Y_test )
36 r2
37

38 # Predict on the test set


39 Y_pred = model . predict ( X_test )
40

41 # Visualization using Seaborn


42 import seaborn as sns
43 import matplotlib . pyplot as plt
44

45 sns . set ( color_codes = True )


46 sns . set_style ( " white " )

Department of CSE(AI&ML), SCEM, Mangaluru Page 31


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 6

47

48 # Create a regression plot


49 ax = sns . regplot ( x = Y_test , y = Y_pred , scatter_kws ={ ’ alpha ’: 0.4})
50 ax . set_xlabel ( ’ Experimental pEC50 ’ , fontsize = ’ large ’ , fontweight = ’ bold ’
)
51 ax . set_ylabel ( ’ Predicted pEC50 ’ , fontsize = ’ large ’ , fontweight = ’ bold ’)
52 plt . show ()

6.3.5 Streamlit Application for Predicting Potency of the


molecule

1 import streamlit as st
2 import pandas as pd
3 from PIL import Image
4 import subprocess
5 import os
6 import base64
7 import pickle
8

9 # Molecular descriptor calculator


10 def desc_calc () :
11 # Performs the descriptor calculation
12 bashCommand = " java - Xms2G - Xmx2G - Djava . awt . headless = true - jar ./
PaDEL - Descriptor / PaDEL - Descriptor . jar - removesalt - standardizenitro
- fingerprints - descriptortypes ./ PaDEL - Descriptor /
P u b c h e m F i n g e r p r i n t e r . xml - dir ./ - file des cr ip to rs _o ut pu t . csv "
13 process = subprocess . Popen ( bashCommand . split () , stdout = subprocess .
PIPE )
14 output , error = process . communicate ()
15 os . remove ( ’ molecule . smi ’)
16

17 # File download
18 def filedownload ( df ) :
19 csv = df . to_csv ( index = False )
20 b64 = base64 . b64encode ( csv . encode () ) . decode () # strings <-> bytes
conversions
21 href = f ’ <a href =" data : file / csv ; base64 ,{ b64 }" download =" prediction .
csv " > Download Predictions </ a > ’
22 return href
23

Department of CSE(AI&ML), SCEM, Mangaluru Page 32


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 6

24 # Model building
25 def build_model ( input_data ) :
26 # Reads in saved regression model
27 load_model = pickle . load ( open ( ’ a c e t y l c h o l i n e s t e r a s e _ m o d e l . pkl ’ , ’ rb
’) )
28 predictions = load_model . predict ( input_data )
29 return predictions

Department of CSE(AI&ML), SCEM, Mangaluru Page 33


Chapter 7

Results and Discussion

7.1 Outcomes
The final results of the drug discovery project for predicting pIC50 values demonstrate
the effectiveness of integrating molecular descriptors with advanced machine learning
techniques. Using models like Random Forest and LazyPredict, the framework achieves
high prediction accuracy, showcasing its ability to identify compounds with desired
bioactivity. Key features such as molecular fingerprints and standardized descriptors
were leveraged to ensure robust predictions. These results highlight the potential of
utilizing cheminformatics and bioinformatics approaches to accelerate drug discovery for
target proteins, promoting efficiency and precision in pharmaceutical research.
Furthermore, the framework’s performance was validated against diverse datasets,
ensuring its adaptability and reliability across different chemical structures. This
versatility underscores the model’s potential as a valuable tool for researchers working
with compounds of varying complexity. By leveraging the predictive capabilities of
machine learning, the project significantly reduces reliance on time-consuming
experimental screening processes. This approach not only saves resources but also
enhances the prioritization of high-potential candidates for further investigation.

7.1.1 Home Page

This project focuses on predicting pIC50 values for chemical compounds with potential
drug activity using a streamlined computational approach. By analyzing molecular
features derived from SMILES strings and processed via PaDEL-Descriptor, the tool
provides researchers with a robust resource for evaluating compound potency. The

34
High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

Figure 7.1: Home Page

platform is designed to simplify drug discovery workflows by offering precise insights


that are easy to interpret through a user-friendly interface. This initiative supports
data-driven pharmaceutical research, enabling researchers to make informed decisions
that enhance efficiency and precision in drug development.
The integration of cheminformatics with machine learning marks a significant
advancement in the field, allowing researchers to process large volumes of chemical data
with minimal manual intervention. By automating feature extraction and prediction
processes, the platform ensures consistency in results and minimizes errors associated
with manual data handling. Additionally, the use of standardized molecular descriptors
facilitates cross-study comparisons and promotes reproducibility, which is a critical
factor in pharmaceutical research.
The tool’s user-centric design enables seamless interaction, ensuring that researchers
with varying levels of expertise in cheminformatics can utilize its features effectively.
The web-based interface simplifies data input and visualization, enabling users to
upload chemical structures in SMILES format and obtain predictions quickly. This
accessibility broadens the tool’s potential audience, from academic researchers to
industry professionals, fostering widespread adoption.
Beyond the immediate goal of pIC50 prediction, this project lays a foundation for

Department of CSE(AI&ML), SCEM, Mangaluru Page 35


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

broader applications in computational drug discovery. Its framework can be adapted to


predict other molecular properties, such as ADMET (Absorption, Distribution,
Metabolism, Excretion, and Toxicity) parameters, enabling a more comprehensive
evaluation of drug candidates. The project’s success demonstrates the transformative
impact of combining computational chemistry with machine learning, paving the way
for innovative approaches in drug development.
In conclusion, the project serves as a testament to the power of AI-driven strategies
in modern pharmaceutical research. By bridging computational chemistry and machine
learning, the project not only accelerates the drug discovery process but also promotes
innovation and accuracy. With future enhancements, including the incorporation of
additional molecular descriptors, more diverse datasets, and advanced predictive
algorithms, the platform has the potential to revolutionize the field of cheminformatics
and contribute significantly to the discovery of safer and more effective therapeutic
agents.

7.1.2 Input File Processing

The system accepts a text file as input, containing a list of SMILES strings along with
their corresponding ChEMBL IDs. The following steps outline the predictive workflow:

• Input File: Users provide a text file with the chemical structures represented as
SMILES and linked to their respective ChEMBL IDs.

• Descriptor Generation: PaDEL-Descriptor computes molecular descriptors and


fingerprints from the SMILES strings.

• Feature Selection: Low-variance features are removed to ensure a robust dataset.

• Machine Learning Model: The processed molecular descriptors are fed into a
pre-trained Random Forest model to predict pIC50 values for the input compounds.

Prediction Process: After uploading the input file, the system validates chemical
structures, generates molecular descriptors, and applies machine learning techniques to
predict the pIC50 values.

7.1.3 Result

Department of CSE(AI&ML), SCEM, Mangaluru Page 36


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

The predictive model evaluates the bioactivity of chemical compounds by analyzing


their molecular descriptors. The system preprocesses the SMILES strings from the
input file to generate standardized molecular descriptors using PaDEL-Descriptor.
These descriptors are fed into the machine learning model, which estimates the pIC50
values—a measure of drug potency. The output is presented in a structured format,
allowing researchers to identify and prioritize potential lead compounds for further
investigation.
This streamlined workflow ensures reproducibility and accuracy, empowering
researchers with a powerful tool for drug discovery targeting specific proteins.

Department of CSE(AI&ML), SCEM, Mangaluru Page 37


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

Figure 7.2: Calculated molecular descriptors

Figure 7.3: Result : prediction output

Department of CSE(AI&ML), SCEM, Mangaluru Page 38


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

7.1.4 Timeline of the Project Work

Table 7.1: Timeline of the Project Work

2024
Tasks: (Months)
Apr May Jun Jul Aug Sep Oct Nov Dec
Selection of Topic
Literature Review
Synopsis Report and PPT
Preparation
Experimenting with
Potential Methodologies
Presentation to Panel
Model Training and
Selection
Model Testing
Model Evaluation
Model Parameters
Optimization
Validation
Preparation of Project
Report

7.1.5 Outcomes Obtained

The following table maps the outcomes of the drug discovery project to their
corresponding objectives and provides measurements of achievement.

Table 7.2: Outcomes Obtained

Objective Outcome Achievement


Measurement
To develop a robust A machine learning model Successful implementation
predictive model for capable of predicting pIC50 of a Random Forest model
estimating pIC50 values for values based on molecular with high accuracy
chemical compounds descriptors derived from
SMILES strings
To leverage molecular Generation of molecular Effective use of molecular
descriptors and fingerprints features using PaDEL- descriptors to enhance
for accurate predictions of Descriptor and integration prediction accuracy
compound potency into the prediction model

Department of CSE(AI&ML), SCEM, Mangaluru Page 39


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

Objective Outcome Achievement


Measurement
To streamline the drug A workflow that enables Reduced time for initial
discovery process for rapid evaluation of compound screening in drug
identifying bioactive compound potency based discovery pipelines
compounds targeting on pIC50 values
specific proteins
To standardize input data Validation and Enhanced reliability
for consistent and reliable preprocessing of SMILES of predictions due to
predictions strings to ensure the standardized input
integrity of molecular processing
descriptors
To provide a user-friendly A web-based interface for Positive user feedback on
platform for researchers to inputting chemical data and ease of use and practicality
evaluate compound activity obtaining pIC50 predictions for drug discovery tasks
To analyze the relationship Identification of key Insights into critical
between molecular features molecular descriptors features that drive
and compound bioactivity influencing compound bioactivity predictions
potency predictions
To incorporate a Integration of molecular Improved model
comprehensive set of fingerprints and performance through
features into the predictive standardized descriptors to feature selection and
framework ensure robust modeling engineering
To support data-driven Delivery of accurate Increased efficiency in
drug discovery by offering pIC50 predictions to identifying high-potential
actionable insights into guide prioritization of lead drug candidates
compound efficacy compounds

7.1.6 Objectives Achieved

1 To develop a predictive model for estimating pIC50 values of chemical


compounds

Achievement: The project successfully built a machine learning model that predicts
pIC50 values for chemical compounds using molecular descriptors generated from

Department of CSE(AI&ML), SCEM, Mangaluru Page 40


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

SMILES strings. The Random Forest model demonstrated high predictive accuracy,
supporting reliable estimation of compound potency.

2 To enable rapid screening of chemical compounds for bioactivity

Achievement: The predictive workflow streamlines the process of evaluating


compound bioactivity, reducing the time required for initial screening. This
facilitates efficient identification of promising candidates for further drug
development.

3 To create an accessible and user-friendly platform for researchers

Achievement: The project includes a web-based platform that allows researchers to


upload chemical data in text file format and obtain pIC50 predictions. The intuitive
interface ensures that users can quickly process input data and interpret results.

4 To integrate molecular descriptors and fingerprints into the predictive


model

Achievement: Molecular descriptors and fingerprints derived from SMILES strings


using PaDEL-Descriptor were integrated into the model. This comprehensive
feature set enhanced the predictive power of the framework.

5 To standardize input data and ensure consistency in predictions

Achievement: SMILES strings were validated and processed to generate


standardized molecular descriptors, minimizing errors and inconsistencies. This
preprocessing step ensured reliable predictions across diverse input compounds.

6 To analyze the role of molecular features in determining compound


potency

Achievement: The project identified key molecular descriptors that influence


compound bioactivity. These insights provide valuable information for
understanding the structural features that drive potency.

7 To support data-driven drug discovery through accurate predictions

Achievement: The predictive model delivers precise pIC50 values, helping


researchers prioritize compounds with high bioactivity. This data-driven approach
enhances decision-making in the drug discovery process.

Department of CSE(AI&ML), SCEM, Mangaluru Page 41


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

8 To enhance the efficiency of drug discovery pipelines

Achievement: By providing accurate and rapid predictions of compound potency,


the project contributes to optimizing the drug discovery pipeline. Researchers can
focus resources on high-potential candidates, improving the overall efficiency of the
process.

9 To demonstrate the scalability of machine learning in handling large


datasets

Achievement: The model was tested on extensive datasets, demonstrating its ability
to scale and maintain predictive accuracy. This scalability ensures applicability to
real-world drug discovery scenarios involving large chemical libraries.

10 To validate the model’s performance across diverse chemical spaces

Achievement: The model’s robustness was evaluated using external datasets with
varied chemical structures. Consistent performance across these datasets confirmed
the model’s generalizability.

11 To reduce dependency on traditional wet-lab experiments for initial


screening

Achievement: By enabling virtual screening of compounds, the project reduces the


reliance on resource-intensive laboratory experiments. This approach saves both
time and costs in early-stage drug discovery.

12 To encourage reproducibility through open-source tools and frameworks

Achievement: The project utilized open-source software such as


PaDEL-Descriptor and Scikit-learn, ensuring that the methodology is
reproducible and accessible to the scientific community.

13 To provide insights into chemical space exploration for lead optimization

Achievement: The predictive model aids researchers in exploring the chemical


space, offering recommendations for structural modifications that improve
compound potency.

Department of CSE(AI&ML), SCEM, Mangaluru Page 42


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

14 To facilitate collaboration among interdisciplinary teams

Achievement: The user-friendly platform and standardized outputs make it easier


for chemists, biologists, and data scientists to collaborate on drug discovery projects.

15 To establish a framework for continuous model improvement

Achievement: The project incorporates mechanisms for updating the model with
new data, enabling continuous learning and improvement over time.

16 To reduce off-target effects through predictive modeling

Achievement: By predicting bioactivity with high specificity, the model helps


minimize the selection of compounds with potential off-target interactions,
improving the safety profile of drug candidates.

17 To integrate cheminformatics and machine learning seamlessly

Achievement: The project bridges the gap between cheminformatics and machine
learning, offering a cohesive framework for data processing, feature generation, and
prediction.

18 To demonstrate the potential of AI in accelerating early-stage drug


discovery

Achievement: The results showcase how AI-driven tools can complement traditional
methods, accelerating the early phases of the drug discovery pipeline.

19 To contribute to knowledge sharing in computational drug discovery

Achievement: The insights and methodologies developed in this project are


documented comprehensively, providing a foundation for future research and
innovation.

20 To promote green chemistry by reducing experimental waste

Achievement: The virtual screening approach significantly reduces the need for
physical experiments, contributing to more sustainable and eco-friendly drug
discovery practices.

7.1.7 Challenges Encountered

During the development of the drug discovery application for predicting pIC50 values
based on molecular descriptors derived from SMILES strings, several challenges, both

Department of CSE(AI&ML), SCEM, Mangaluru Page 43


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

major and minor, were encountered.

Table 7.3: Challenges Encountered

Challenge Description Steps Taken to Address


1. Selecting Determining which Conducted feature
Relevant molecular descriptors to selection using statistical
Molecular include in the dataset methods (e.g., mutual
Descriptors posed a challenge. information, correlation
Including too few analysis) and domain
descriptors could limit knowledge to identify the
model performance, while most relevant descriptors.
too many could lead to
overfitting.
2. Choosing Identifying the best model Evaluated multiple
the Appropriate for predicting pIC50 values algorithms, including
Machine Learning was complex, as different Random Forest, XGBoost,
Model algorithms performed and Support Vector
differently on the dataset. Machines. Random
Forest was selected for its
robustness and ability to
handle diverse datasets
effectively.
3. Handling Data The dataset contained Applied techniques
Imbalance an imbalance in the like Synthetic Minority
distribution of active Oversampling Technique
and inactive compounds, (SMOTE) and class
which could bias the weighting to address
model. the imbalance and
improve prediction for
underrepresented classes.

Department of CSE(AI&ML), SCEM, Mangaluru Page 44


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

Challenge Description Steps Taken to Address


4. Preprocessing SMILES strings needed Implemented
Input SMILES standardization to preprocessing steps
Strings ensure uniformity and such as salt removal,
compatibility with normalization of nitro
descriptor generation groups, and checking for
tools. valid chemical structures
using cheminformatics
libraries.
5. Handling High- The molecular descriptor Dimensionality reduction
Dimensional Data dataset was high- techniques, such as
dimensional, increasing Principal Component
the risk of overfitting and Analysis (PCA) and
computational challenges. feature selection
algorithms, were employed
to retain only the most
significant features.
6. Model Overfitting was a concern Utilized regularization
Overfitting due to the large number of techniques, such as
descriptors and relatively L1 (Lasso) and L2
small dataset size. (Ridge), along with cross-
validation to improve
generalizability.
7. Interpreting the Understanding the Used model interpretation
Contribution of influence of specific tools, such as SHAP
Descriptors descriptors on model (SHapley Additive
predictions was crucial exPlanations), to evaluate
but challenging. the contribution of each
feature to predictions.

Department of CSE(AI&ML), SCEM, Mangaluru Page 45


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 7

Challenge Description Steps Taken to Address


8. Data Quality Missing or inconsistent Imputation techniques
and Missing Values values in molecular were applied to handle
descriptors impacted missing values, ensuring
model accuracy. a complete and clean
dataset before training.
9. Integration of Efficient integration Developed automated
External Tools of tools like PaDEL- scripts to preprocess
for Descriptor Descriptor with the input SMILES and
Generation predictive pipeline was generate descriptors
complex. seamlessly, minimizing
manual intervention.
10. Validation of Ensuring the reliability Conducted evaluation
Model Predictions and accuracy of the using metrics such as
model’s predictions RMSE, MAE, and R-
required rigorous squared, and validated
validation. predictions against
experimental data when
available.

Department of CSE(AI&ML), SCEM, Mangaluru Page 46


Chapter 8

Conclusion and Future Work

This project highlights the critical role of advanced computational tools and machine
learning in drug discovery, specifically in predicting pIC50 values based on molecular
descriptors derived from SMILES strings. By employing data-driven methodologies,
researchers can efficiently identify potential drug candidates, significantly reducing the
time and cost associated with traditional experimental methods. The integration of
cheminformatics with predictive modeling provides actionable insights into molecular
properties, enabling targeted optimization of compounds. This approach not only
accelerates the drug discovery pipeline but also enhances the precision of candidate
selection, contributing to more efficient and sustainable pharmaceutical research
practices. Overall, the project demonstrates the potential of AI-driven strategies in
addressing complex challenges in drug discovery, offering both scientific and economic
benefits.
The predictive framework developed in this study has proven to be a valuable tool
for prioritizing compounds with high bioactivity. Its success demonstrates the
importance of combining computational innovation with domain expertise in tackling
the multifaceted challenges of drug development. The project’s outcomes also
underscore the value of integrating cheminformatics techniques with modern machine
learning approaches to create robust and interpretable models. This approach has the
potential to complement experimental efforts, paving the way for breakthroughs in
pharmaceutical research and development.
To further enhance the accuracy and applicability of our predictive model, future
work will focus on addressing current limitations and expanding the project’s scope.
This includes developing a more robust dataset, refining feature selection, and integrating
advanced algorithms to improve prediction accuracy. The proposed future work includes:

47
High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 8

• Dataset Expansion: Collecting a larger and more diverse dataset, including a


wider range of molecular descriptors and experimental pIC50 values, to improve
model robustness and generalizability. Expanding the dataset to include compounds
from diverse chemical and biological spaces will help in creating a globally relevant
predictive framework.

• Feature Engineering: Exploring novel molecular descriptors and applying


advanced feature selection techniques to identify the most predictive features for
pIC50 estimation. Combining traditional descriptors with deep molecular
representations, such as those derived from graph neural networks, may enhance
model capabilities.

• Algorithm Optimization: Leveraging state-of-the-art machine learning


techniques, such as deep learning and ensemble methods, to enhance model
accuracy and scalability. Developing hybrid models that combine multiple
learning paradigms could further improve the robustness of predictions.

• Model Validation and Benchmarking: Validating the model against external


datasets and benchmarking it with other state-of-the-art prediction tools to
ensure its reliability and relevance. This will involve incorporating cross-validation
strategies and rigorous statistical evaluations to strengthen model credibility.

• Integration of Domain Expertise: Collaborating with chemists and


pharmacologists to refine the interpretation of results and ensure alignment with
practical drug development needs. These collaborations will also help
contextualize predictions within real-world pharmaceutical applications.

• Application Development: Building a user-friendly software tool that


integrates predictive modeling and cheminformatics capabilities, enabling
researchers to efficiently predict pIC50 values and explore molecular optimization
strategies. This tool could also support high-throughput screening by processing
large datasets simultaneously.

• Incorporation of ADMET Properties: Expanding the model to include


predictions for Absorption, Distribution, Metabolism, Excretion, and Toxicity
(ADMET) properties, offering a more comprehensive evaluation of drug
candidates. Including ADMET predictions will make the tool more relevant to
preclinical stages of drug development.

Department of CSE(AI&ML), SCEM, Mangaluru Page 48


High-Potency Molecule Prediction Using AI-Driven Computational Model Chapter 8

• Real-Time Data Integration: Developing the ability to incorporate real-time


updates from public databases, such as PubChem and ChEMBL, ensuring that the
model evolves with the latest available data.

• Integration with Virtual Screening Pipelines: Incorporating the model into


virtual screening workflows, enabling researchers to quickly evaluate large
chemical libraries and identify high-potential candidates for synthesis and
experimental validation.

• Exploration of Multi-Task Learning Approaches: Investigating multi-task


learning methods to simultaneously predict pIC50 values and other related
molecular properties, enhancing the utility of the predictive framework in broader
contexts.

• Inclusion of Structural Biology Insights: Leveraging structural biology data,


such as protein-ligand interactions, to further refine the model and provide deeper
insights into the mechanisms underlying compound potency.

By addressing these areas, the project aims to advance the field of computational
drug discovery, making predictive modeling tools more accurate, reliable, and accessible
to researchers worldwide. This will contribute to the discovery of safer and more effective
therapeutic agents, ultimately benefiting global healthcare. Furthermore, the insights
and methodologies developed through this project could inspire similar initiatives in
other domains, such as environmental chemistry, agrochemicals, and materials science,
showcasing the broad applicability of computational approaches in scientific research.

Department of CSE(AI&ML), SCEM, Mangaluru Page 49


Reference Inference

[1] Quazi, S., Fatima, Z. Role of Artificial Intelligence and Machine Learning in Drug
Discovery and Drug Repurposing. In IGI Global eBooks 2023, pp. 1394–1405.
https://doi.org/10.4018/979-8-3693-3026-5.ch062

[2] Shahab, M., Danial, M., Duan, X., Khan, T., Liang, C., Gao, H.,
Chen, M., Wang, D., Zheng, G. Machine Learning-based Drug Design for
Identification of Thymidylate Kinase Inhibitors as a Potential Anti-Mycobacterium
Tuberculosis. Journal of Biomolecular Structure and Dynamics 2023, 1–13.
https://doi.org/10.1080/07391102.2023.2216278

[3] Özelçelik, R., Van Tilborg, D., Jiménez-Luna, J., Grisoni, F. Structure-
based Drug Discovery with Deep Learning. ChemBioChem 2023, 24(13).
https://doi.org/10.1002/cbic.202200776

[4] Husnain, A., Rasool, S., Saeed, A., Hussain, H. K. Revolutionizing Pharmaceutical
Research: Harnessing Machine Learning for a Paradigm Shift in Drug Discovery.
International Journal of Multidisciplinary Sciences and Arts 2023, 2(2), 149–157.
https://doi.org/10.47709/ijmdsa.v2i2.2897

[5] Siebenmorgen, T., Menezes, F., Benassou, S., Merdivan, E., Didi, K., Mourão, A.
S. D., Kitel, R., Li’o, P., Kesselheim, S., Piraud, M., Theis, F. J., Sattler, M.,
Popowicz, G. M. MISATO: machine learning dataset of protein–ligand complexes for
structure-based drug discovery. Nature Computational Science 2024, 4(5), 367–378.
https://doi.org/10.1038/s43588-024-00627-2

[6] Martinelli, D. D. Generative Machine Learning for De Novo Drug Discovery:


A Systematic Review. Computers in Biology and Medicine 2022, 145, 105403.
https://doi.org/10.1016/j.compbiomed.2022.105403

50
[7] Pandey, M., Fernandez, M., Gentile, F., Isayev, O., Tropsha, A., Stern, A.
C., Cherkasov, A. The Transformational Role of GPU Computing and Deep
Learning in Drug Discovery. Nature Machine Intelligence 2022, 4(3), 211–221.
https://doi.org/10.1038/s42256-022-00463-x

[8] Akondi, V. S., Menon, V., Baudry, J., Whittle, J. Novel Big Data-Driven Machine
Learning Models for Drug Discovery Application. Molecules 2022, 27(3), 594.
https://doi.org/10.3390/molecules27030594

[9] Patel, V., Shah, M. Artificial Intelligence and Machine Learning in Drug
Discovery and Development. Intelligent Medicine 2022, 2(3), 134–140.
https://doi.org/10.1016/j.imed.2021.10.001

[10] Dara, S., Dhamercherla, S., Jadav, S. S., Babu, C. M., Ahsan, M. J. Machine
Learning in Drug Discovery: A Review. Artificial Intelligence Review 2021, 55(3),
1947–1999. https://doi.org/10.1007/s10462-021-10058-4

[11] Gaudelet, T., Day, B., Jamasb, A. R., Soman, J., Regep, C., Liu, G.,
Hayter, J. B. R., Vickers, R., Roberts, C., Tang, J., Roblin, D., Blundell,
T. L., Bronstein, M. M., Taylor-King, J. P. Utilizing graph machine learning
within drug discovery and development. Briefings in Bioinformatics 2021, 22(6).
https://doi.org/10.1093/bib/bbab159

[12] Roy, S. N., Mishra, S., Yusof, S. M. Emergence of Drug Discovery in


Machine Learning. In Studies in Computational Intelligence 2021, pp. 119–138.
https://doi.org/10.1007/978-981-33-4698-77

[13] Patel, L.; Shukla, T.; Huang, X.; Ussery, D.W.; Wang, S. Machine
Learning Methods in Drug Discovery. Molecules 2020, 25, 5277.
https://doi.org/10.3390/molecules25225277

[14] Rajula, H. S. R., Verlato, G., Manchia, M., Antonucci, N., Fanos, V.
Comparison of Conventional Statistical Methods with Machine Learning in
Medicine: Diagnosis, Drug Development, and Treatment. Medicina 2020, 56(9), 455.
https://doi.org/10.3390/medicina56090455

[15] Hudson, I. L. Data Integration Using Advances in Machine Learning in Drug


Discovery and Molecular Biology. In Methods in Molecular Biology 2020, pp.
167–184. https://doi.org/10.1007/978-1-0716-0826-57

51
[16] Rodrigues, T., Bernardes, G. J. Machine Learning for Target Discovery in
Drug Development. Current Opinion in Chemical Biology 2020, 56, 16–22.
https://doi.org/10.1016/j.cbpa.2019.10.003

[17] Carpenter, K. A., Huang, X. Machine Learning-based Virtual


Screening and Its Applications to Alzheimer’s Drug Discovery: A
Review. Current Pharmaceutical Design 2018, 24(28), 3347–3358.
https://doi.org/10.2174/1381612824666180607124038

[18] Scheeder, C., Heigwer, F., Boutros, M. Machine Learning and Image-Based
Profiling in Drug Discovery. Current Opinion in Systems Biology 2018, 10, 43–52.
https://doi.org/10.1016/j.coisb.2018.05.004

[19] Agarwal, S., Dugar, D., Sengupta, S. Ranking Chemical Structures for Drug
Discovery: A New Machine Learning Approach. Journal of Chemical Information
and Modeling 2010, 50(5), 716–731. https://doi.org/10.1021/ci9003865

[20] Burbidge, R., Trotter, M., Buxton, B., Holden, S. Drug Design by Machine Learning:
Support Vector Machines for Pharmaceutical Data Analysis. Computers Chemistry
2001, 26(1), 5–14. https://doi.org/10.1016/s0097-8485(01)00094-8

52

You might also like