Bioinformatics Complete All 5 Units Notes
Bioinformatics Complete All 5 Units Notes
RITESH JAISWAL
ASTHA DUBEY
BASICS OF
BIOINFORMATICS-II
Edition-2020
Edited by
This study material will be helpful for all the students who are enrolled for the B.tech
(biotechnology) programs who are in even semester and are going to give their exams soon.
This study material is totally based on AKTU CURRICULUM and I and my student have
tried to cover all the topics of the syllabus and notes is also arranged for the students in
simple language to promote their understanding also. I am very thankful to my student
who is pursuing her B.tech has helped me allot.
Ritesh Jaiswal
CONTENT
Unit I
Unit II
Basics of RNA
Features of RNA Secondary Structure
RNA structure prediction methods:
a. Based on self-complementary regions in RNA sequence
b. Minimum free energy methods
Suboptimal structure prediction by MFOLD
Prediction based on finding most probable structure and Sequence co-
variance method.
Application of RNA structure modeling.
Unit III
Machine learning
Decision tree induction
Artificial Neural Networks
Hidden Markov Models
Genetic Algorithms
Simulated Annealing
Support vector machines
The relation between statistics and machine learning
Evaluation of prediction methods: Parametric and Non-parametric tests
Cross-validation and empirical significance testing (empirical cycle)
Clustering (Hierarchical and K-mean).
Unit IV
Unit V
7
b). Isolation of RNA-
Nucleic acid molecules are separated by applying an electric field to move the negatively
charged molecules through a matrix of agarose or other substances. Shorter molecules move
faster and migrate farther than longer ones because shorter molecules migrate more easily
through the pores of the gel. This phenomenon is called sieving.
8
3. Blotting techniques-:
A blot, in molecular biology and genetics, is a method of
transferring proteins, DNA or RNA onto a carrier (for example, a nitrocellulose, polyvinylidene
fluoride or nylon membrane). In many instances, this is done after a gel electrophoresis,
transferring the molecules from the gel onto the blotting membrane, and other times adding the
samples directly onto the membrane. After the blotting, the transferred proteins, DNA or RNA
are then visualized by colorant staining (for example, silver staining
proteins) autoradiographic visualization of Radiolabelled molecules (performed before the blot),
or specific labelling of some proteins or nucleic acids. The latter is done
with antibodies or hybridization probes that bind only to some molecules of the blot and have
an enzyme joined to them. After proper washing, this enzymatic activity (and so, the molecules
we search in the blot) is visualized by incubation with proper reactive, rendering either a colored
deposit on the blot or a chemiluminescent reaction which is registered by photographic film.
9
10
Homology modeling-:
11
Steps of genome sequence annotation-
.
ORF identification-:
- DNA is a genetic material that contains all the genetic information in a living organism.
The information is stored at genetic code using ATGC. During the transcription process
DNA is transcribed to mRNA.
- Each of theses base pair will bond with a sugar and phosphate molecule to form a
nucleotide. This nucleotide that codes for a particular amino acids. During translation is a
CODON.
12
- The region of a nucleotide that starts from Initiation codon and ends with a stop codon is
called as ORF (open reading frame).
- Proteins are formed from ORF by analyzing the ORF we can predict the possible amino
acids that might be produced during translation.
- The ORF finder is a program that is available at NCBI website it identifies all ORF or
protein coding regions from 6 different frames.
HOW TO FIND ORF?
- Consider a hypothetical sequence.
Ex- CAT, GGA, GTA, TCG, CAG, GGT, CAA. (First reading frame).
- The second reading is formed after leaving the first one nucleotide and then grouping the
sequences into words of 3 nucleotide.
C ATGGAGTATCGCAGGGTC AA (Second reading frame).
- The third reading frame is formed after leaving the first 2 nucleotides and then grouping
the sequences into words of 3 nucleotides.
CA TGGAGTATCGCAGGGTCA A (Third reading frame).
- Others reading frame will be obtained by just finding the reverse complementary
sequence of the above reading frames.
TTGACCCTGCGATACTCCATG (fourth)
T TGACCCTGCGATACTCCA TG (fifth)
TT GACCCTGCGATACTCCAT G ( sixth)
- In all the reading frames formed, we have to replace thymine with uracil ( T with U) ,
now mark the start and stop codon in the reading frame.
CAUGGAGUAUCGCAG GGT CAA.
C AUGGAGUAU CGCAGGGUC AA
CA UGGAGUAUCGCAGGGUCAA
UUGACCCUGCGAUACUCCAUG
U UGACCCUGCGAUACUCCAUG
UUGACCCUGCGAUACUCCAU G
13
Start codons – AUG
Stop codons UAA, UGA, UAG.
Microarray technique
Microarray analysis techniques are used in interpreting the data generated from experiments
on DNA (Gene chip analysis), RNA, and protein microarrays, which allow researchers to
investigate the expression state of a large number of genes - in many cases, an organism's
entire genome - in a single experiment.Such experiments can generate very large amounts of
data, allowing researchers to assess the overall state of a cell or organism. Data in such large
quantities is difficult - if not impossible - to analyze without the help of computer programs.
15
Protein function prediction
- Protein function prediction are method or technique that bioinformatics researcher used
to assign biochemical roles to proteins. These proteins are usually once that are poorly
studied or predicted based on genomic sequence data. These predictions are often driven
by data intensive computational procedure.
16
- Information may come from nucleic acids sequence homology, gene expression profile,
protein domain structure, mining of publication , phylogenetic profile , protein-protein
interactions.
Methods-
1- Homology based method-
- Proteins of similar sequences are usually homologous and thus have similar functions.
- Hence protein in a newly sequence genome are properly annotated using the sequences of
similar in relation to genome.
- However closely related protein do not always share same function.
For ex- the yeast GAL1 and GAL3 paralogs (73% identical & 92% similar) have evolved
vary in different function with GAL1 being an (Galactokinase) andGAL3 being an
transcription inducer.
17
Next generation sequencing (NGS)
-
- NGS massively parallel or deep sequencing are related terms that describes a DNA
sequencing technology which have revolutionised Genomic research.
- Using NGS and entire human genome can be sequenced within a single day. In contrast
the previous Sanger sequencing technology used to sequence human genome over a
decade.
- There are no. of different NGS platforms. All platforms performs sequencing of millions
of small fragments of DNA inputs.
Potential uses of NGS in clinical practice:-
18
Currently pilot project are underway using NGS of cancer genomes in clinical practice
,mainly aiming to identify mutation in tumour that can be targeted by mutation specific
drugs.
Limitation:-
The main disadvantages of NGS in the clinical setting is putting in plce the required
infrastructure such as computer capacity and storage , and also the personel expertise
required to comprehensively analyse and interpret the subsequent data.
NGS has huge potential but is presently used primarily for research.
INTRODUCTION
In Secondary Structure Prediction, We will get three dimensional structure of protein, from that
three dimensional structure we will get the function of the specific protein.
Secondary Structure Prediction are devised by many tools.
PURPOSE OF SECONDARY STRUCTURE PREDICTION /OBJECTIVES
19
SECONDARY STRUCUTURE ELEMENTS
ALPHA HELIX
Most abundant secondary structure.
It has 3.6 amino acids per turn.
Average length have 10 amino acids, it varies from5 to 40.
Inner facing side chains are hydrophobic.
Third of every fourth amino acid is hydrophobic.
BETA SHEET
Propensity Value
Many online and offline server tools are available to predict the secondary structure by Chou
Fasman method.
20
ALPHA HELIX MAKERS:
Alanine
Glutamine
Leucine
Methionine
ALPHA HELIX BREAKERS:
Proline Glysine
BETA SHEET MAKERS:
Isoleucine
Valine
Tyrosine
BETA SHEET BREAKERS:
Proline
Asparagine Glutamine
PROPENSITY VALUE
Tendency of the aminoacids to behave more in alpha helix and beta sheet.
Propensity value for Alpha Helix = Frequency of amino acids in Alpha helix Frequency
of residues to be in Alpha helix
If Alpha helix is made up of 20 amino acids and amino acids present for 5 times ,
Frequency of amino acids = 5/20= 0.25
If the total 100 residues in the protein, but only 20 makes the Alpha helix, Frequency of
residues to be in the helix =20/100=0.2
This is applicable for beta sheets also.
Rule for Alpha helix formation:
21
Four such amino acid have propensity value of helix is greater than 100, it terminates the
elongation.
There are 9 amino acid sequence with more maker and less breaker make Alpha helix.
Negative charged amino acid at N terminal and Positive charged amino acid at C terminal
form the ALPHA HELIX.
If propensity value for Alpha helix is greater than OR equal to 1.03, form the ALPHA
HELIX.
Alpha helix makers should be greater than Alpha helix breakers.
The propensity value of Alpha helix should be greater than the propensity value of Beta
sheet.
GOR method assumes that amino acids up to 8 residues on each side influence the
secondary structure of the central residue.
This program is now fourth version.
The accuracy of GOR when checked against a set of 267 proteins of known structure is
64%.
This implies that 64% of the amino acids were correctly predicted as being helix, sheet or
coil.
The algorithm uses a sliding window of 17 amino acids.
It is based on the hypothesis that short homologous sequences of amino acids have the
same secondary structure tendencies.
A list of short sequence s is made by sliding a window of length n along a set of
approximately 100- 400 training sequences of known structure but minimal sequence
similarity.
22
(4)HIDDEN MARKOV METHOD
(5)NEURAL NETWORKS
Most effective structure prediction tool for pattern recognition and classification.
The protein sequence is translated into patterns by shifting a window of n adjacent
residues (n= 13-21) through the protein.
Methods to predict,
Hierchical Neural Network
nnPredict
PSA
PSIPRED
Gen THREADER
MEMSAT
PSI BLAST Version 2.0
SOPMA correctly predicts 69.5% of amino acids for a three state description of the
secondary structure in a whole database containing 126 chains of nonhomologous
proteins.
Joint prediction with SOPMA and PHD correctly predicts 82.2% of residues for 74% of
co- predicted amino acids.
23
(2) TERTIARY STRUCTURE PREDICTION
Introduction
Homology modelling
As the name suggests, homology modeling predicts protein structures based on sequence
homology with known structures.
It is also known as comparative modeling.
The principle behind it is that if two proteins share a high enough sequence similarity,
they are likely to have very similar three-dimensional structures.
If one of the protein sequences has a known structure, then the structure can be copied to
the unknown protein with a high degree of confidence.
The overall homology modeling procedure consists of six major steps and one additional
step.
24
1. Template Selection :-
The template selection involves searching the Protein Data Bank (PDB) for homologous
proteins with determined structures.
The search can be performed using a heuristic pairwise alignment search program such as
BLAST or FASTA.
However, programming based search programmes such as SSEARCH or ScanPS can
result in more sensitive search results.
Homology models are classified into 3 areas in terms of their accuracy and reliability.
Midnight Zone: Less than 20% sequence identity. The structure cannot reliably be used
as a template.
Twilight Zone: 20% - 40% sequence identity.
Sequence identity may imply structural identity.
Safe Zone: 40% or more sequence identity. It is very likely that sequence identity implies
structural identity
Often, multiple homologous sequences may be found in the database. Then the sequence
with the highest homology must be used as the template.
2. Sequence Alignment:
Once the structure with the highest sequence similarity is identified as a template, the
full-length sequences of the template and target proteins need to be realigned using
refined alignment algorithms to obtain optimal alignment.
Incorrect alignment at this stage leads to incorrect designation of homologous residues
and therefore to incorrect structural models.
Therefore, the best possible multiple alignment algorithms, such as Praline and T-Coffee
should be used for this purpose.
Once optimal alignment is achieved, the coordinates of the corresponding residues of the
template proteins can be simply copied onto the target protein.
If the two aligned residues are identical, coordinates of the side chain atoms are copied
along with the main chain atoms.
If the two residues differ, only the backbone atoms can be copied.
25
4. Loop Modeling :
In the sequence alignment for modeling, there are often regions caused by insertions and
deletions producing gaps in sequence alignment.
The gaps cannot be directly modeled, creating “holes” in the model.
Closing the gaps requires loop modeling which is a very difficult problem in homology
modeling and is also a major source of error.
Currently, there are two main techniques used to approach the problem: the database
searching method and the ab initio method.
The database method involves finding “spare parts” from known protein structures in a
database that fit onto the two stem regions of the target protein.
The stems are defined as the main chain atoms that precede and follow the loop to be
modeled.
The best loop can be selected based on sequence similarity as well as minimal steric
clashes with the neighboring parts of the structure.
The conformation of the best matching fragments is then copied onto the anchoring
points of the stems.
The ab initio method generates many random loops and searches for the one that does not
clash with nearby side chains and also has reasonably low energy and φ and ψ angles in
the allowable regions in the Ramachandran plot.
Schematic of loop modeling by fitting a loop structure onto the endpoints of existing stem
structures represented by cylinders.
FREAD is a web server that models loops using the database approach.
PETRA is a web server that uses the ab initio method to model loops.
CODA is a web server that uses a consensus method based on the prediction results from
FREAD and PETRA.
Once main chain atoms are built, the positions of side chains that are not modeled must
be determined.
A side chain can be built by searching every possible conformation at every torsion angle
of the side chain to select the one that has the lowest interaction energy with neighboring
atoms.
Most current side chain prediction programs use the concept of rotamers, which are
favored side chain torsion angles extracted from known protein crystal structures.
A collection of preferred side chain conformations is a rotamer library in which the
rotamers are ranked by their frequency of occurrence.
In prediction of side chain conformation, only the possible rotamers with the lowest
interaction energy with nearby atoms are selected.
26
A specialized side chain modeling program that has reasonably good performance is
SCWRL, which is a UNIX program.
6. Model Refinement :
In these loop modeling and side chain modeling steps, potential energy calculations are
applied to improve the model.
Modeling often produces unfavorable bond lengths, bond angles, torsion angles and
contacts.
Therefore, it is important to minimize energy to regularize local bond and angle geometry
and to relax close contacts and geometric chain.
The goal of energy minimization is to relieve steric collisions and strains without
significantly altering the overall structure.
However, energy minimization has to be used with caution because excessive energy
minimization often moves residues away from their correct positions.
GROMOS is a UNIX program for molecular dynamic simulation. It is capable of
performing energy minimization and thermodynamic simulation of proteins, nucleic
acids, and other biological macromolecules.
The simulation can be done in vacuum or in solvents.
A lightweight version of GROMOS has been incorporated in SwissPDB Viewer.
7. Model Evaluation:
The final homology model has to be evaluated to make sure that the structural features of
the model are consistent with the physicochemical rules.
This involves checking anomalies in φ–ψ angles, bond lengths, close contacts, and so on.
If structural irregularities are found, the region is considered to have errors and has to be
further refined.
Procheck is a UNIX program that is able to check general physicochemical parameters
such as φ–ψ angles, chirality, bond lengths, bond angles, and so on.
WHAT IF is a comprehensive protein analysis server that has many functions, including
checking of planarity, collisions with symmetry axes, proline puckering, anomalous bond
angles, and bond lengths.
Few other programs for this step are ANOLEA, Verify3D, ERRAT, WHATCHECK,
SOV etc.
Threading/Fold recognition
27
The comparison emphasizes matching of secondary structures, which are most
evolutionarily conserved.
The algorithms can be classified into two categories, pairwise energy based and profile
based.
In the pairwise energy based method, a protein sequence is searched for in a structural
fold database to find the best matching structural fold using energy-based criteria.
The detailed procedure involves aligning the query sequence with each structural fold in
a fold library.
The alignment is performed essentially at the sequence profile level using dynamic
programming or heuristic approaches.
Local alignment is often adjusted to get lower energy and thus better fitting.
The next step is to build a crude model for the target sequence by replacing aligned
residues in the template structure with the corresponding residues in the query.
The third step is to calculate the energy terms of the raw model, which include pairwise
residue interaction energy, solvation energy, and hydrophobic energy.
Finally, the models are ranked based on the energy terms to find the lowest energy fold
that corresponds to the structurally most compatible fold.
Profile Method
Ab initio method
When no suitable structure templates can be found, Ab Initio methods can be used to
predict the protein structure from the sequence information only.
As the name suggests, the ab initio prediction method attempts to produce allatom protein
models based on sequence information alone without the aid of known protein structures.
Protein folding is modeled based on global free-energy minimization.
Since the protein folding problem has not yet been solved, the ab initio prediction
methods are still experimental and can be quite unreliable.
One of the top ab initio prediction methods is called Rosetta, which was found to be able
to successfully predict 61% of structures (80 of 131) within 6.0 Å RMSD (Bonneau et al.,
2002).
29
UNIT-2
BASIC CONCEPTS OF RNA SECONDARY STRUCTURE PREDICTION
30
THE STRUCTURE OF RNA-
1- In RNA, nucleotides the sugar which is used is ribose and therefore they are also called ribonucleotide.
2- The purine bases are adenine and guanine but the pyrimidines are cytosine and uracil.
3- RNA molecules are much smaller than DNA molecules & they are also called as linear polymers.
4- Moreover they do not seen to have regular 3-D structure and are mostly single standard.
5- This makes them more flexible than DNA & they can also act as enzymes.
6- The molecules also contain a very stable 3-D structure with unpaired region which are very flexible.
7- The wobble base pair makes an important factor for this flexibility.
8- Beside the WATSON-CRICK BASE PAIR, A:U & G:C , is the wobble base pair G:U one of the most
common base pair in RNA molecule.
9- But actually, any of the bases can build hydrogen bond with any other base.
10- Another difference between DNA & RNA is that double stranded RNA builds alpha-helices while ds
DNA builds beta-helices.
11- The major group of the alpha-helix is rather narrow and bigger .This is due to ribose needing more space
than deoxyribose.
1-The primary structure of a molecule describe only the 1-D sequence of its components.
2-The primary structure of RNA is almost identical to the primary structure of DNA besides the
component being A,C,G&U. instead of T.
3-The secondary structure of molecules is more complex than the primary structure and can been drawn
in 2-D space.
4- RNA secondary structure is mainly composed of double stranded RNA regions form by folding the
single stranded RNA molecule back on itself.
5-The tertiary structure is an overall 3-D structure of molecules.
6-It is build on the interaction of the lower border secondary structure.
7-Helices are examples of RNA & DNA tertiary structure.
31
8-Pseudoknot is a tertiary structure of RNA.
4-Junction or multi loop-Junction in loop 2 or more double stranded region conversing to form a
closed structure.
5-Pseudoknots- Pseudoknots is a tertiary structural elements of RNA. It is formed by base pairing
between an already existing secondary structure loops & the free ending. Nucleotides within a hair pin
loop, form base pairs with nucleotides outside the stem. Hence, this pair occurs that overlap each other
in their sequence position.
32
33
ASSUMPTIONS IN RNA STRUCTURE PREDICTION
• The most likely structure is similar to the energetically most stable structure.
• The energy associated with any position in the structure is only influenced by local sequence and
strtucture.
• The structure formed does not produce pseudoknots.
• One method of representing the base pairs of a secondary structure is to draw the structure in a
circle.
• An arc is drawn to represent each base pairing found in the structure.
34
• Align bases according to their ability to pair with each other gives an approach to determining
the optimal structure
Methods adopted
Nussinov Algorithm
Four ways to get the optimal structure between position I and j from the optimal substructure
1.Add i,j pair onto best structure found for subsequence i+1,j-1
2.Add unpaired position i onto best structure for subsequence i +1,j
3.Add unpaired position i onto best structure for subsequence I, j-1
4.Combine two optimal structures i,k and k+1,j
NussinovAlgorithm
• compares a sequence against itself in a n*n matrix
• Find the maximum of the scores for the four possible structures at a particular position.
• Base pair maximization will not necessarily lead to the most stable structure
• May create structure with many interior loops or hairpins which are energetically unfavorable
• Comparable to aligning sequences with scattered matches – not biologically reasonable
Energy Minimization
• Thermodynamic Stability
• Estimated using experimental techniques
• Theory : Most Stable is the Most likely
• No Pseudknots due to algorithm limitations
• Uses Dynamic Programming alignment technique
35
Energy Minimization Drawbacks
M-fold predicts optimal and suboptimal secondary structures for RNA or DNA molecule using the most
recent energy minimization method.
M-fold is an adaptation of the M-fold package (version 2.3) by zuker and taeqer that has been
modifying to work with the Wisconsin package.these method uses and energy rues develop by turner
and colleagues to determine optimal and sub-optimal secondary structure for an RNA molecule.
Using energy minimization criteria any predicted optimal secondary structure for an RNA or DNA
molecules depends on model of folding and specific folding energies used to calculate the structure.
Covariance Method:-
It describe a general approach to several RNA sequence analysis problems using probabilistics
models that flexibly describe the secondary structure ad primary sequence consensus of an RNA
sequence family.
These models is known as covariance models.
A covariance model of t RNA sequence is an extremely sensitive and discriminative tool for
searching for additional t RNA and tRNA related sequences in sequence database.
A model can be automatically from an existing sequence alignment.
Also describe an algorithm for learning a model and hence a consensus secondary structure from
initially unaligned example sequence and no prior structural information.
Models trained on unaligned t RNA example correctly predict t RNA secondary structure and
produce high quality multiple alignment.
The approach may be applied to any family of small RNA sequences.
Probabilistic model, also known as covariance model ,which cleanly describes both secondary
structure and primary sequence consensus of an RNA .
Using covariance models ,we introduce new and general approaches to several RNA analysis
problems: consensus secondary structure prediction,,ultiple sequence alignment and databse
similarity searching.
Also describe a dynamic programming algorithm for efficiently finding the globally optimal
alignments of RNA sequences to a model , and show how to use the algorithm for database
searching.
36
This models are constructed automatically from existing RNA sequence alignments or even from
initially unaligned example sequence, using an iterative training procedure that is essentially an
automatic implantation of comparative sequence analysis and an algorithm that we believe in the
first optimal algorithm for RNA secondary structure prediction on pairwise covariations n multiple
alignments.
We test these algorithms using data taken from a trusted alignment of this t RNA sequences(12) and
on genomic sequence data from the C elegans genome sequencing project.
We find that an automatically constructed t RNA covariance models a significantly more sensitive
for databse searching that even the best custom built t RNA searching program.
Our method produce t RNA alignments of higher accuracy than other automatic methods and they
invariably predict the correct consensus cloverleaf t RNA secondary structure when given unaligned.
Example t RNA sequence.
•Knowing the shape of a biomolecule is invaluable in drug design and understanding disease mechanisms
•Current physical methods (X-Ray, NMR) are too expensive and time-consuming
37
UNIT 3
Machine learning-:
-machine learning is subfield of computer science and artificial intelligence that deals with the construction and
study of systems that can learn from data ,rather than follow only specific explicitly programs instruction.
-besides computer science and artificial intelligence it has strong statistics and optimization which deliver both
methods and theory to the field.
Examples= application includes spam filtering, optical character recognition,search engines and computer
viruses.
-machine learning, data mining,pattern recognition are sometimes conflated. Machine learning task can be of
several forms –
1-supervised learning
2-unsupervised learning
1-SUPERVISED -: In supervised, the computer is presented with example inputs and their desired outputs
given by a ‘teacher’ inputs to outputs.
3.In REINFORCEMENT LEARNING,a computer program interacts with a dynamic environment in which it
perform a certain goal such as driving a vehicle without a teacher explicitly telling it whether it has come close
to its goal or not.
One approach is to pose a series of questions about the characteristics of the species. The first question we may
ask is whether the species is cold or warm blooded. If it is not a mammal .Otherwise it is either a bird or
38
mammal. In a latter case we need to ask a follow up question: - to the females of the species give birth to their
young?
Those that give birth are definitely mammals while those that do not are likely to be non-mammals (with the
exception of lay egging mammals such as the spiny and eater).
The previous example demonstrate how we can solve a classification problem by asking a series of questions
about the attributes of the test records. Each time we receive an answer, a follow up question is asked until we
reach conclusion about the class label of the record. The series of question and their possible answer can be
organized in the form of a decision tree, which hierarchical structure consisting of nodes and directed edges.
ROOT NODE
Body temperature
Warm Cold
GIVES BIRTH Non-mammals
( (INTERNAL NODE)
Yes no
Mammals Non-mammals
LEAF NODES
39
The tree has three types of nodes
1-ROOT NODE
2-INTERNAL NODE
3-LEAF NODE OR TERMINAL NODE
-A root node that has no incoming edges and zero or more outgoing edges.
-INTERNAL NODE: each of which have exactly one incoming edge and two or more outgoing edges.
-LEAF NODE: each of which have exactly one incoming edge and no outgoing edges.
“A computing system made up of number of highly processing elements, which process information by their
dynamic state response to internal inputs.
-The human brain is composed of 86 billion nerve cells called neurons. They are connected to other thousands
cells by axons .Stimuli from external environment or inputs from sensory organs are accepted by dendrites.
These inputs create electric impulses which quickly travel through the neural network.
-A neuron can then send the message to other neurons to handle the issue.
-ANN are composed of multiple nodes which initiate the biological neurons of human brain. The neurons are
connected by links and they interact with each other. The nodes can take input data and perform simple
operations on the data. The result of these operations is passed through other neurons. The outputs at each node
is called its ACTIVATION OR NODE VALUE.
40
1-FEED FORWARD ANN-:
-In this ANN the information flow is unidirectional. A unit sense information to other unit from which it does
not receive any information. There are no feedback loops.
-They are used in pattern generation /recognition /classification.
-They have fixed inputs and outputs.
Working of ANN-:
-In the topology diagrams shown each arrow represents a components a connection between two neurons and
indicates the pathway for the flow of information. Each connection has a weight, an integer number that
controls the signals between 2 neurons.
-If the network generates a good or desired output there is no need to adjust the weights. However if the
network generates a poor or undesired output or an error than the system utters the weight in order to remove
subsequent results.
2-unsupervised learning
3-reinforcement learning
-Humans develop artificial intelligence systems by introducing into them every possible intelligence. They
would, for which the humans themselves now seems to be threatened.
1- THREAT TO PRIVACY.
2- THREAT TO HUMAN DIGNITY.
3- THREAT TO SAFETY.
41
1-THREAT TO PRIVACY- An artificial intelligence program that recognizes speech and understand natural
language is theoretically capable of understanding each conversation on emails and telephones.
-Hidden Markov model is a statistical markov model in which the system being modeled is assumed to be
markov model with unobserved (hidden) states.
-The hidden markov model can be represented as the simplest bayesion network.
The mathematics behind the HMM work develop by L.E Baum and coworkers.
-In simpler markov model, the states is directly visible to observer and therefore the states transition
probability are the only parameters while in HMM the state is not directly visible but the output dependent
on the states is visible.
-HMM are specially known for their application in reinforcement learning and temporal patterns such as
speech and writing gesture recognition and in bioinformatics.
A- Markov model
Example- 1- talk about the weather
1- Assume there are 3 weather (sunny, rainy & foggy).
3-Weather predictions is about the would the weather tomorrow. (based on the observations in the past)
4-Weather at day ‘n’ is
[ Qn epsilon(sunny, rainy,foggy)]
Where qn = depends on the known weather of past days(qn-1,qn-2,-----).
42
4- Gene prediction.
5- Handwriting recognition.
6- Protein folding.
7- Sequence classification.
8- DNA motif discoverey.
9- Transportation forecasting.
10- Machine translation.
11- Single molecule kinetic analysis.
GENETIC ALGORITHM-:
Genetic Algorithms:-Genetic algorithms are examples of evolutionary computing methods and are
optimizationtype algorithms. Given a population of potential problem solutions (individuals), evolutionary
computing expands this population with new and potentially better solutions. They are the basis for
evolutionary computing algorithms is biological evolution, where over time evolution produces the best or
“fittest” individuals.
-In Data mining, genetic algorithms may be used for clustering, prediction, and even association rules.
-When using genetic algorithms to solve a problem, the first thing, and perhaps the most difficult task, that must
be determined is how to model the problem as a set of individuals. In the real world, individuals may be
identified by a complete encoding of the DNA structure.
- An individual typically is viewed as an array or tuple of values. Based on the recombination (crossover)
algorithms, the values are usually numeric and maybe binary strings.
These individuals are like a DNA encoding in the structure for each individual represents an encoding of the
major features needed to model the problem. Each individual in the population is represented as a string of
characters from the given alphabet.
43
Definition: Given an alphabet A, an individual or chromosome is a string I = I1, I2,…, In where Ij є A. Each
character in the string, Ij, is called a gene. The values that each character can have are called the alleles. A
populations, P, is a set of individuals.
In genetic algorithms, reproduction is defined by precise algorithms that indicate how to combine the given set
of individuals to produce new ones. These are called “crossover algorithms”. For example: Given two
individuals; parents from a population, the crossover technique generates new individuals (offspring or
children) by switching subsequences of the string.
Genetic Algorithms
-As in nature, mutations sometimes appear, and these also may be present in genetic algorithms. The
mutation operation randomly changes characters in the offspring and a very small probability of
mutation is set to determine whether a character should change.
44
- Since genetic algorithms attempts to model nature, only the strong survive. When new individuals are
created, a choice must be made about which individuals will survive. This may be the new individuals,
the old ones, or more likely a combination of the two. It is the part of genetic algorithms that
determines the best (or fittest) individuals to survive.
- To sum up all these information, Margaret Dunham defines a genetic algorithm (GA) as a
computational model consisting of five part: Starting set of individuals.
Crossover technique.
• Mutation algorithm.
• Algorithms that applies the crossover and mutation to P iteratively using the fitness function to
determine the fitness function to determine the best individuals in P to keep.
Simulated annealing
LOS
ANGELES MIAMI
CHICAGO HOUSTAN
-
- The salesman needs to minimize the number of miles he travels and itenarary is better if it is shorter .
- There are many feasible itenararies to choose from we are looking for the best one.
- NOTE-: simulated annealing solves this type of problems.
Why annealing?
-simulated annealing is inspired by a metal working process called annealing.
-it uses the equation that describes changes in a metals embodied, energy during the annealing process.
NOTE- Simulated annealing is generally considered a good choice for solving optimization problems.
45
SUPPORT VECTOR MACHINE ( SVM)-:
-
- SVM are based on the concept of decision plans that define decision boundaries. A decision plan is one
that separate between a set of objects having different class memberships.
Fig-: a schematic example is shown here
decision plane
red
- In this example, objects belong either to class red or green. The separating line defines the boundary on
the right side of which all objects are green and to the left of which all objects are red. Any new objects
falling to the right is labelled or classified as green and vice versa.
- The above is a classical example of linear classifier, i.e. a classifier that separates a set of objects into
their respective groups (green and red in this case) with a line . most classification task are not so simple
and often more complex structures are needed in order to make an optimal separation, i.e. correctly
classify new objects on the basis of examples that are available .
- Compare to the previous schematic figure, it is clear that a full separation of the green and red objects
could require a curve (which is more complex than a line).
- Classification task based on drawing separating lines to distinguish between objects of different class
membership are known as hyper plain classification.
- SVM are particularly suited to handle such task.
NOTE- SVM is primarily a classifier method that performs classification task by constructing hyper
plain in a multidimensional space that separates cases of different class levels.
Quantifying the evidencefor making sense of it in quantitative form,the researcher can answer empirical
questions which should be clearly defined and answerable with evidence collected (usually all data).
Many researchers can combine qualitative and quantitative form of analysis to better answer question
which cannot to studied in laboratory settings and in education.
In some fields quantitative research may begin with a research question which is texted through
experimental. Usually researchers has a certain theory regarding a topic under investigation .Based on
46
this theory statement or hypothesis will be proposed. From this hypothesis,prediction about specific
events are derived. These predictions can then be texted with a suitable example experiment depending
on the outcome of the experiment,the theory on which the hypothesis and prediction were based will be
supported or not or may need to be modified and then subjected to further texting.
2. THEORY, ASSUMPTIONS, BACKGROUND LITERATURE, what does the relevant literature in the
field indicate about this problem? To which theory or conceptual framework can I link it? What are the
criticisms of this approach, or how does it constrain the research process? What do I know for certain about
this area? What is the history of this problem that others need to know?
3. VARIABLES AND HYPOTHESES, what will I take as given in the environment? Which are the
independent and which are the dependent variables? Are there control variables? Is the hypothesis specific
enough to be researchable yet still meaningful? How certain am I of the relationship(s) between variables?
4. OPERATIONAL DEFINITIONS AND MEASUREMENT, what is the level of aggregation? What is the
unit of measurement? How will the research variables be measured? What degree of error in the findings is
tolerable? Will other people agree with my choice of measurement operations?
5. RESEARCH DESIGN AND METHODOLOGY, what is my overall strategy for doing this research?
Will this design permit me to answer the research question? What other possible causes of the relationship
between the variables will be controlled for by this design? What are the threats to internal and external
validity?
47
10. CONCLUSIONS, INTERPRETATIONS, RECOMMENDATIONS, was my initial hypothesis
supported? What if my findings are negative? What are the implications of my findings for the theory base,
for the background assumptions, or relevant literature? What recommendations can I make for public
policies or programs in this area? What suggestions can I make for further research on this topic?
Clustering:-
Clustering can be considered the most important unsupervised learning problem, so as every other problem
of this kind, it deals with finding a structure in a collection of unlabeled data.
A definition of clustering could be “the process of organizing objects into groups whose members are similar
in some way”.
A cluster is therefore, a collection of object which are similar in between thenand are dissimilar to the
objects belonging to other clusters.
Clustering is the task of dividing the population or data points into a number of groups such that data points
in the same groups are more similar to other data points in the same group than those in other groups. In
simple words, the aim is to segregate groups with similar traits and assign them into clusters.
Let’s understand this with an example. Suppose, you are the head of a rental store and wish to understand
preferences of your costumers to scale up your business. Is it possible for you to look at details of each
costumer and devise a unique business strategy for each one of them? Definitely not. But, what you can do
is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a separate
strategy for costumers in each of these 10 groups. And this is what we call clustering.
Now, that we understand what is clustering. Let’s take a look at the types of clustering.
Types of Clustering
Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For
example, in the above example each customer is put into one group out of the 10 groups.
48
Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or
likelihood of that data point to be in those clusters is assigned. For example, from the above scenario each
costumer is assigned a probability to be in either of 10 clusters of the retail store.
Now I will be taking you through two of the most popular clustering algorithms in detail – K Means
clustering and Hierarchical clustering. Let’s begin.
K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This algorithm
works in these 5 steps :
1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.
2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown using red color
and two points in cluster 2 shown using grey color.
49
3. Compute cluster centroids : The centroid of data points in the red cluster is shown using red cross and those
in grey cluster using grey cross.
4. Re-assign each point to the closest cluster centroid : Note that only the data point at the bottom is assigned
to the red cluster even though its closer to the centroid of grey cluster. Thus, we assign that data point into
grey cluster
50
5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.
6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat the 4th and 5th steps until
we’ll reach global optima. When there will be no further switching of data points between two clusters for
two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.
51
Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This
algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are
merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.
The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be interpreted
as:
Applications of Clustering
Clustering has a large no. of applications spread across various domains. Some of the most popular
applications of clustering are:
Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection
52
Evaluation of prediction methods: Parametric and Non- parametric tools:-
Assumptions can greatly simplify the learning process, but can also limit what can be learned.
Algorithms that simplify the function to a known form are called parametric machine learning
algorithms.
NOTE: A learning model that summarizes data with a set of parameters of fixed size (independent of the
number of training examples) is called as parametric model. No matter how much data you throw at a
parametric model, it won’t change its mind about how many parameters it needs.
Benefits-
Limitations-
Algorithms that do not make strong assumptions about the form of the mapping functions are called
non-parametric learning algorithms.
Examples: support vector machines.
Benefits-
53
Limitations- More data, slower and overfitting.
Statisticians work on much the same type of modeling problems under the names of applied statistics and
statistical learning. Coming from a mathematical background, they have more of a focus on the behavior
of models and explainability of predictions.
The very close relationship between the two approaches to the same problem means that both fields have
a lot to learn from each other. The statisticians need to consider algorithmic methods was called out in the
classic “two cultures” paper. Machine learning practitioners must also take heed, keep an open mind, and
learn both the terminology and relevant methods from applied statistics.
Visualization
Statistics
Hypothesis testing
How the model work
In predictive modeling, data is collected for the relevant predictors, a statistical model is formulated,
predictions are made and the model is validated (or revised) as additional data becomes available. The
model may employ a simple linear equation or a complex neural network, mapped out by sophisticated
software.
Here you will learn what a predictive model is, and how, by actively guiding marketing campaigns, it
constitutes a key form of business intelligence. we'll take a look inside to see how a model works-
5. Conclusions
56
Why Predictive Modelling ?
Nearly every business in competitive markets will eventually need to do predictive modeling to remain
ahead of the curve. Predicting Modeling (also known as Predictive Analytics) is the process of
automatically detecting patterns in data, then using those patterns to foretell some event. Predictive
models are commonly built to predict:
• Linear regression
• Logistic regression
• Neural networks
• K-nearest-neighbors classification
57
• Decision trees
• Ensembles of trees
• Gradient boosting
Applications of Predictive Modelling
Health Care
Collection Analytics
Cross-cell
Fraud detection
Risk management
Industry Applications
Predictive modelling are used in insurance, banking, marketing, financial services, telecommunications,
retail, travel, healthcare, oil & gas and other industries.
Predictive Models in Retail industry
Campaign Response Model – this model predicts the likelihood that a customer responds to a specific
campaign by purchasing a products solicited in the campaign. The model also predicts the amount of the
purchase given response.
Regression models
Customer Segmentation
Customer Retention/Loyalty/Churn
Inventory Management
Predictive Models in Telecom industry
Campaign analytics
58
Churn modeling
Customer segmentation
Fraud analytics
Network optimization
Price optimization
SAS Analytics
R
STATISTICA
MATLAB
Minitab
59
UNIT-4
A force field (a special case of energy functions or interatomic potentials; not to be confused with force
field in classical physics) refers to the functional form and parameter sets used to calculate the potential
energy of a system of atoms or coarse-grained particles in molecular mechanics and molecular
dynamics simulations. The parameters of the energy functions may be derived from experiments
in physics or chemistry, calculations in quantum mechanics, or both.
Where k is the force constant, l is the bond length and l0 is the value for the bond length when all other
terms in the force field are set to 0. The term l0 is often referred to as the equilibrium bond length which may
cause confusion. The equilibrium bond length would be the value adopted in a minimum energy structure with
all other terms contributing.
(C) Parametrization
In addition to the functional form of the potentials, force fields define a set of parameters for different types of
atoms, chemical bonds, dihedral angles and so on. The parameter sets are usually empirical. A force field would
60
include distinct parameters for an oxygen atom in a carbonyl functional group and in a hydroxyl group. The
typical parameter set includes values for atomic mass, van der Waals radius, and partial charge for individual
atoms, and equilibrium values of bond lengths, bond angles, and dihedral angles for pairs, triplets, and
quadruplets of bonded atoms, and values corresponding to the effective spring constant for each potential.
INTRODUCTION TO SIMULATION
Definition :-
A simulation is the imitation of the operation of real-world process or system over time.
The behavior of a system that evolves over time is studied by developing a simulation model.
Goal of modelling
A model can be used to investigate a wide verity of “what if” questions about real-world system.
Potential changes to the system can be simulated and predicate their impact on the system.
61
It is better to do simulation before Implementation.
Mathematical Methods
It is simple
Simulation enable the study of internal interaction of a subsystem with complex system
Informational, organizational and environmental changes can be simulated and find their effects
Simulation can be used with new design and policies before implementation
62
Simulating different capabilities for a machine can help determine the requirement
Simulation models designed for training make learning possible without the cost disruption
The modern system (factory, wafer fabrication plant, service organization) is too complex that its
internal interaction can be treated only by simulation.
In contrast to optimization models, simulation models are “run” rather than solved.
Given as a set of inputs and model characteristics the model is run and the simulated behavior is
observed
Advantages :-
New policies, operating procedures, information flows and son on can be explored without disrupting
ongoing operation of the real system.
63
New hardware designs, physical layouts, transportation systems and … can be tested without
committing resources for their acquisition.
Time can be compressed or expanded to allow for a speed-up or slow-down of the phenomenon( clock is
self-control).
Insight can be obtained about interaction of variables and important variables to the performance.
Bottleneck analysis can be performed to discover where work in process, the system is delayed.
Disadvantages:-
Vendors of simulation software have been actively developing packages that contain models that only
need input (templates).
Areas :-
Manufacturing Applications
Semiconductor Manufacturing
Military application
Health Care
64
Automated Material Handling System (AMHS)
Risk analysis
Insurance, portfolio,...
Computer Simulation
CPU, Memory,…
Network simulation
65
Types of Simulation
Continuous simulation is a simulation based on continuous time, rather than discrete time steps, using numerical
integration of differential equations.
It is notable as one of the first uses ever put to computers, dating back to the Eniac in 1946. Continuous
simulation allows prediction of
rocket trajectories
hydrogen bomb dynamics
electric circuit simulation
robotic
A discrete-event simulation (DES) models the operation of a system as a (discrete) sequence of events in time.
Each event occurs at a particular instant in time and marks a change of state in the system.[1] Between
consecutive events, no change in the system is assumed to occur; thus the simulation time can directly jump to
the occurrence time of the next event, which is called next-event time progression.
Example
A common exercise in learning how to build discrete-event simulations is to model a queue, such as customers
arriving at a bank to be served by a teller. In this example, the system entities are Customer-queue and Tellers.
The system events are Customer-Arrival and Customer-Departure. (The event of Teller-Begins-Service can be
part of the logic of the arrival and departure events.) The system states, which are changed by these events,
are Number-of-Customers-in-the-Queue (an integer from 0 to n) and Teller-Status (busy or idle). The random
variables that need to be characterized to model this system stochastically are Customer-Interarrival-
Time and Teller-Service-Time. An agent-based framework for performance modeling of an optimistic parallel
discrete event simulator is another example for a discrete event simulation
Hybrid Simulation (sometime Combined Simulation) corresponds to a mix between Continuous and Discrete
Event Simulation and results in integrating numerically the differential equations between two sequential events
to reduce the number of discontinuities.
A computer model is the algorithms and equations used to capture the behavior of the system being
modeled.
By contrast, computer simulation is the actual running of the program that contains these equations or
algorithms.
In-silico -"performed on computer or via computer simulation"
The phrase was coined in 1989 as an allusion to the Latin phrases in-vivo, in-vitro, andin-situ
ADVANTAGES
Reduction
Refinement
Replacement
Translation – “results of animal experimentation to human”
LIMITATIONS
Biological complexities
Missed experiences
Biological variability
Publication of results
Student attitudes
IN BIOMEDICAL RESEARCH:-
A broad area of science that involves the investigation of the biological process and the causes of disease,
through careful experimentation, observation, laboratory work, analysis, and testing
67
SOME EXAMPLES OF COMPUTER SIMULATIONIN BIOMEDICAL RESEARCH
Kidney function –transport of electrolytes and water in and out of kidney
Cardiac function –enzyme metabolism of cardiac muscle, cardiac pressure flow relationships, etc.
Lung function –respiratory mechanics
Sensory physiology –peripheral auditory system and single auditory nerve fibre transmission of
vibrations
Neurophysiology –impulse propagation along myelinated acxons
Developmental biology –shape changes in embryonic cells
IN BEHAVIOURAL RESEARCH
LIMITATIONS
Lack of knowledge of all possible parameters
No satisfactory model exists
Development of computer simulation depends on the use of animals in biomedical research
PROTEIN DESIGN
example is RosettaDesign, a software package under development and free for academic use
RosettaDesign can be used to identify sequences compatible with a given protein backbone
68
Some of Rosetta design's successes include the design of a novel protein fold, redesign of an existing
protein for greater stability, increased binding affinity between two proteins, and the design of novel enzymes
SENSITIVITY ANALYSIS-
Sensitivity analysis is the assessment of the impact for an output of a system by changing its inputs.
example-In budgeting process there are always variables that are uncertain such as-
69
Break-even analysis
If you are unable to estimate a policy’s most likely effects or cannot find comparable studies to help determine
its best-case and worst-case scenarios, you can use break even analysis.
Monte Carlo analysis
You can use Monte Carlo analysis to examine multiple variables simultaneously and simulate thousands of
scenarios, resulting in a range of possible outcomes and the probabilities that they will occur.
ADVANTAGES
Simplicity
Directing Management Efforts
Ease of being Automated
As a quality Check.
DISADVANTAGES
It does not provide clear cut results
Not a solution in standalone form
70
UNIT5
Document clustering
Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has
applications in automatic document organization, topic extraction and fast information retrieval or filtering.
Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that
describe the contents within the cluster. Document clustering is generally considered to be a centralized process.
Examples of document clustering include web document clustering for search users.
The application of document clustering can be categorized to two types, online and offline. Online applications
are usually constrained by efficiency problems when compared to offline applications.Text clustering may be
used for different tasks, such as grouping similar documents (news, tweets, etc.) and the analysis of
customer/employee feedback, discovering meaningful implicit subjects across all documents.
In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes
single link, complete linkage, group average and Ward's method. By aggregating or dividing, documents can be
clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers
from efficiency problems. The other algorithm is developed using the K-means algorithm and its variants.
Generally hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms
based around variants of the K-means algorithm are more efficient and provide sufficient information for most
purposes.
These algorithms can further be classified as hard or soft clustering algorithms. Hard clustering computes a hard
assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is
soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional
membership in several clusters. Dimensionality reduction methods can be considered a subtype of soft
clustering; for documents, these include latent semantic indexing (truncated singular value decomposition on
term histograms) and topic models.
Other algorithms involve graph based clustering, ontology supported clustering and order sensitive clustering.
Given a clustering, it can be beneficial to automatically derive human-readable labels for the clusters. Various
methods exist for this purpose.
Clustering in search engines
A web search engine often returns thousands of pages in response to a broad query, making it difficult for users
to browse or to identify relevant information. Clustering methods can be used to automatically group the
retrieved documents into a list of meaningful categories.
Procedures
71
In practice, document clustering often takes the following steps:
1. Tokenization
Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases.
Commonly used tokenization methods include Bag-ofwords model and N-gram model.
We can then cluster different documents based on the features we have generated. See the algorithm section in
cluster analysis for different types of clustering methods.
6. Evaluation and visualization
Finally, the clustering models can be assessed by various metrics. And it is sometimes helpful to visualize the
results by plotting the clusters into low (two) dimensional space. See multidimensional scaling as a possible
approach.
Clustering v. Classifying
Clustering algorithms in computational text analysis groups documents into grouping a set of text what are
called subsets or clusters where the algorithm's goal is to create internally coherent clusters that are distinct
from one another. Classification on the other hand, is a form of supervised learning where the features of the
documents are used to predict the "type" of documents.
An information retrieval process begins when a user enters a query into the system. Queries are formal
statements of information needs, for example search strings in web search engines. In information retrieval a
query does not uniquely identify a single object in the collection. Instead, several objects may match the query,
perhaps with different degrees of relevancy. An object is an entity that is represented by information in a
content collection or database. User queries are matched against the database information. However, as opposed
to classical SQL queries of a database, in information retrieval the results returned may or may not match the
query, so results are typically ranked. This ranking of results is a key difference of information retrieval
searching compared to database searching.
Depending on the application the data objects may be, for example, text documents, images, audio,mind maps
or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead
represented in the system by document surrogates or metadata.
The two most important information retrieval systems in case of bioinformatics are as follows
ENTREZ in NCBI
SRS in EMBL.
Definition of NLP-:
- To explain linguistic theories, to use the theories to build systems that can be of social use
- Started off as a branch of Artificial Intelligence
- Borrows from Linguistics, Psycholinguistics, Cognitive Science & Statistics.
73
- A hallmark of human intelligence.
- Text is the largest repository of human knowledge and is growing quickly.
- Computer programmes that understood text or speech.
History of NLP-:
- In 1950, alan turing published an article “machine and intelligence” which advertised what is now
called the turing test as a subfield of intelligence.
- Some beneficial and successful natural language systems were developed in the 1960s were
SHRDLU, a natural language system working in restricted “blocks of words” with restricted
vocabularies was written between 1964 to 1966.
Components of NLP-
Steps of NLP
74
Morphological and lexical analysis
- The lexicon of a language is its vocabulary that includes its words and expressions.
- Morphology depicts analyzing, identifying and description of structure of words.
- Lexical analysis involves dividing a text into paragraphs , words and the sentences.
Syntactic analysis
- Syntax concerns the proper orderinf of words and its effect on meaning.
- This involves analysis of the words in a sentence to depict the grammatical structure of the sentence .
- The words are transformed into structure that shows how the words are related to each other. Ex- “
the girl goes to the school”. This would definitely be rejected by the English syntactic analyser.
Semantic analysis
-Semantic errors concerns the (literal)meaning of words, phrases and sentences.
- This abstract the dictionary meaning or the exact meaning from the context.
- The structures which are created by the syntactic analyzer are assigned meaning. Ex “colorless blue
idea”. This would be rejected by the analyzer as colorless blue do not make any sense together.
Discourse integration
- Pragmatic analysis concerns the overallcommunicative and social context and its effect on
interpretation.
- It means abstracting or deriving the purposeful use of the language in situations.
- Importantly those aspects of language which require world knowledge.
- The main focus is on what was said is reinterpreted on what it actually means. Ex- “close the
window?”. Should have been interpreted as a request rather than an order.
75
Natural language generation
- NLG is the process of constructing natural language outputs from non-linguistic inputs.
- NLG can be viewed as the reverse process of NL understanding.
- A NLG system may have two main parts:
Discourse planner
What will be generated. Which sentences.
Surface realizer
Realizes a sentence from its internal representation.
Lexical selection
Selecting the correct words describing the concepts.
Techniques and methods
- Machine learning
. The learning procedures used during machine learning.
. Automatically focuses on the most common cases.
. Whereas when we write rules by hand it is often not correct at all.
. Concerned on human errors.
- Statistical interference
. Automatic learning procedures can make use of statistical interference algorithms.
. Used to produce models that are robust (means strength) to unfamiliar input. Ex- containing words
or structures that have not been seen before.
. Making intelligent guesses.
- Input database and training data
. Systems based on automatically learning the rules can be made more accurate simply by supplying
more input data or source to it.
. However, systems based on hand –written rules can only be made more accurate by increasing the
complexity of the rules , which is a much more difficult task.
77
Drug users-:
- If someone is using drugs, you might notice changes in how the person looks or acts.
Here are some of the signs …
- Become moody, negative, cranky or worried all the time ask to be left alone a lot.
- Have trouble concentration.
- Loose or gain weight.
- Lose interest in hobbies.
- Change friends.
- Get in flights.
- Depression.
Insilico drug designing-
- Drug designing is the inventive method of finding new medications based on the
knowledge of a biological target.
- Selected or designed molecule should be :
Organic small molecule
Complementary in shape.
Oppositely charged to the biomolecule target.
- This molecule will –
Interact with.
78
Bind to the target.
Activates or inhibits the function of a biomolecule such as a protein.
- In Silico is an expression used to mean “performed on computer or via computer
simulation”.
- In Silico drug designing is defined as the identification of the drug target molecule
by employing bioinformatics tools.
- TYPES OF INSILICO DRUG DESIGNING:
-
LIGAND BASED DRUG DESIGNING:
- It relies on knowledge other molecules that bind to the biological target of interest .
- used to derive a pharmacophore.
STRUCTURE BASED DRUG DESIGNING:
- It relies on the knowledge of the 3-D structure of a biological target obtained
through methods such as-
X-ray crystallography.
NMR spectroscopy.
Homologymodelling.
79
Target selection:
- Biochemical pathways could become abnormal and result in disease.
- Select a target at which to disrupt the biochemical process.
- Targets can be enzymes, receptors and nucleic acids.
Target validation:
- Perform the protein BLAST for all genes/proteins with respect to Homo sapiens.
- Select the least matching molecule in human and again perform the BLAST.
- As the query sequence matched best, so we selected our target molecule and its
structure can be obtained from RCSB AND PDB.
Structure determination:
Crystal structure of target protein can be taken from PDB database.
80
81
Lead optimization:
- Refining the 3-D structure of the lead.
- Technique is QSAR.
Preclinical and clinical development:
- Preclinical trial:
. In vitro studies on animal.
. Efficacy and pharmacokinetic information.
- Clinical trial :
. 3 phases
. Safety and efficacy on human beings.
- File NDA:
. Document submitted to FDA for review.
. FDA approval.
Molecular docking
In the field of molecular modeling, docking is a method which predicts the preferred
orientation of one molecule to a second when bound to each other to form a
stable complex. Knowledge of the preferred orientation in turn may be used to predict the
strength of association or binding affinity between two molecules using, for example, scoring
functions.
82
Schematic illustration of docking a small molecule ligand (green) to a protein target (black)
producing a stable complex.
Docking of a small molecule (green) into the crystal structure of the beta-2 adrenergic G-protein coupled
receptor.
The associations between biologically relevant molecules such as proteins, peptides, nucleic
acids, carbohydrates, and lipids play a central role in signal transduction. Furthermore, the relative orientation
of the two interacting partners may affect the type of signal produced (e.g., agonism vs antagonism). Therefore,
docking is useful for predicting both the strength and type of signal produced.
Molecular docking is one of the most frequently used methods in structure-based drug design, due to its ability
to predict the binding-conformation of small molecule ligands to the appropriate target binding site.
Characterisation of the binding behaviour plays an important role in rational design of drugs as well as to
elucidate fundamental biochemical processes.
QSAR approach attempts to identify and quantify the physicochemical properties of a drug and to see whether
any of these properties has an effect on the drug’s biological activity by using a mathematical equation.
PHYSICOCHEMICAL PROPERTIES
• Hydrophobicity of the molecule
83
• Hydrophobicity of substituents
• Electronic properties of substituents
• Steric properties of substituents
WORKING OF QSAR
A range of compounds is synthesized in order to vary one physicochemical property and to test it affects
the bioactivity.
A graph is then drawn to plot the biological activity on the y axis versus the physicochemical feature on
the x axis.
It is necessary to draw the best possible line through the data points on the graph. This done by
procedure known as linear regression analysis by the least square method.
If we draw a line through a set of data points will be scattered on either side of the line. The best line
will be the one closest to the data points.
To measure how close the data points are, vertical lines are drawn from each point.
Log (1/C)
Log (1/C)
Log P Log P
HYDROPHOBICITY
Hydrophobic character of a drug is crucial to how easily it crosses the cell membrane and may also important in
receptor interactions. Hydrophobicity of a drug is measured experimentally by testing the drugs relative
distribution in octanol water mixture.
84
ELECTRONIC EFFECT
The electronic effect of various sustituents will clearly have an effect on drug ionisation and polarity. Have an
effect on how easily a drug can pass through the cell membrane or how strongly it can interact with a binding
site.
STEARIC FACTORS
The bulk, size and shape of a drug will influence how easily it can approach and interact with binding site. A
bulky substituents may act like a shield and hinder the ideal interaction between a drug and its binding site.
Bulky substituent may help to orient a drug properly for maximum binding and increase activity.
The four processes involved when a drug is taken are absorption, distribution, metabolism and elimination or
excretion (ADME).
Pharmacokinetics is the way the body acts on the drug once it is administered. It is the measure of the rate
(kinetics) of absorption, distribution, metabolism and excretion (ADME). All the four processes involve drug
movement across the membranes. To be able to cross the membranes it is necessary that the drugs should be
able dissolve directly into the lipid bilayer of the membrane; hence lipid soluble drugs cross directly whereas
drugs that are polar do not.
85
Figure showing the interplay between absorption, distribution, metabolism and excretion (ADME).
Absorption
Absorption is the movement of a drug from its site of administration into the blood. Most drugs are absorbed by
passive absorption but some drugs need carrier mediated transport. Small molecules diffuse more rapidly than
large molecules. Lipid soluble non – ionized drugs are absorbed faster. Absorption is affected by blood flow,
pain stress etc.
Acidic drugs such as aspirin will be better absorbed in the stomach whereas basic drug like morphine will be
absorbed better in the intestine. Most of the absorption of the drug takes place in the small intestine. Since the
surface area of the stomach is much smaller than that of the intestine. Most of the drugs are absorbed in the
small intestine since the amount of time that the drugs spend in the stomach is less and also the surface area of
the stomach is small. If a basic drug is taken after a meal then the activity of the drug can be reduced whereas if
an acidic drug is taken after a meal then the action of the can be noticed much more quickly, owing to the
gastric absorption.
Distribution
Distribution is the movement of drugs throughout the body. Determined by the blood flow to the tissues, it is
ability of the drug to enter the vasculature system and the ability of the drug to enter the cell if required.
86
plasma are partly in solution and partly bound to the plasma protein. The bound drug is inactive and the
unbound drug is active. The ratio of bound to the unbound drug varies. Binding is reversible. Generally
acidic drugs bind to albumin and basic drugs to α1 – acid glycoprotein.
Tissue Distribution
After absorption most drugs are distributed in the blood to the body tissue where they have their effect.
The degree to which the drug is likely to accumulate in the tissue is dependent on the lipophilicity and
local blood flow to the tissue. Highly perfused organs receive most of the drugs.
Metabolism or Biotransformation
It is the process of transformation of a drug within the body to make it more hydrophilic so that it can be
excreted out from the body by the kidneys. This needs to be done since drugs and chemicals are foreign
substances in our body. If the drug continues to be in the lipohilic state and is going to be filtered by the
glomerulus then it will be reabsorbed and remain in the body for prolonged periods. Hence metabolism deals
with making the drug more hydrophilic such that it can be excreted out from the body. In some cases the
metabolites can be more active than the drug itself e.g. anxiolytic benzodiazepines.
Excretion
Excretion is the removal of the substance from the body. Some drugs are either excreted out unchanged or some
are excreted out as metabolites in urine or bile. Drugs may also leave the body by natural routes such as tears,
sweat, breath and saliva. Patients with kidney or liver problem can have elevated levels of drug in the system
and it may be necessary to monitor the dose of the drug appropriately since a high dose in the blood can lead to
drug toxicity.
87
Pharmacodynamics
Pharmacodynamics describes the action of drug on body and influence of drug concentration on magnitude of
response.
An agent that can bind to receptor and produces biological response is called as agonist.
The interaction between drug and its receptor can be described by a curve called as drug response curve.
The magnitude of drug effect depends on drug concentration at receptor site and this availability and
concentration of drug at receptor is determined by both dose of drug administered and by drug
pharmacodynamics (ADME).
88
89
90
Two important properties of drug are –
Efficacy
Potency
Potency-
Efficacy-
Maximal efficacy of a drug assumes that all receptors are occupied by the drug and if more drugs are
added, no additive response will be observed.
A drug with greater efficacy is more therapeutically beneficial than the one that is more potent.
91
92
Lipinski's rule of five-
also known as Pfizer's rule of five or simply the rule of five (RO5), is a rule of thumb to
evaluate druglikeness or determine if a chemical compound with a certain pharmacological or biological
activity has chemical properties and physical properties that would make it a likely orally active drug in
humans. The rule was formulated by Christopher A. Lipinski in 1997, based on the observation that most orally
administered drugs are relatively small and moderately lipophilic molecules.[1][2]
The rule describes molecular properties important for a drug's pharmacokinetics in the human body, including
their absorption, distribution, metabolism, and excretion ("ADME"). However, the rule does not predict if a
compound is pharmacologically active.
The rule is important to keep in mind during drug discovery when a pharmacologically active lead structure is
optimized step-wise to increase the activity and selectivity of the compound as well as to ensure drug-like
physicochemical properties are maintained as described by Lipinski's rule. Candidate drugs that conform to the
RO5 tend to have lower attrition rates during clinical trials and hence have an increased chance of reaching the
market.
No more than 5 hydrogen bond donors (the total number of nitrogen–hydrogen and oxygen–
hydrogen bonds)
No more than 10 hydrogen bond acceptors (all nitrogen or oxygen atoms)
A molecular mass less than 500 daltons
An octanol-water partition coefficient (log P) that does not exceed 5
Note that all numbers are multiples of five, which is the origin of the rule's name. As with many other rules of
thumb, such as Baldwin's rules for ring closure, there are many exceptions.
Variants
In an attempt to improve the predictions of drug-likeness, the rules have spawned many extensions, for example
the Ghose filter:
Lead-like
During drug discovery, lipophilicity and molecular weight are often increased in order to improve the affinity
and selectivity of the drug candidate. Hence it is often difficult to maintain drug-likeness (i.e., RO5 compliance)
during hit and lead optimization. Hence it has been proposed that members of screening libraries from which
hits are discovered should be biased toward lower molecular weight and lipophility so that medicinal chemists
will have an easier time in delivering optimized drug development candidates that are also drug-like. Hence the
rule of five has been extended to the rule of three (RO3) for defining lead-like compounds.
A rule of three compliant compound is defined as one that has:
95
96
97