0% found this document useful (0 votes)

684 views

Non-Coding Rna Prediction of Clinically Important Genomic Analysis

Pdf file of M.Sc dessertation report done by Kalyan Kumar Pasumarthy

Uploaded by

kalyankpy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

684 views

Non-Coding Rna Prediction of Clinically Important Genomic Analysis

Pdf file of M.Sc dessertation report done by Kalyan Kumar Pasumarthy

Uploaded by

kalyankpy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 44

NON-CODING RNA PREDICTION OF CLINICALLY

IMPORTANT MYCOPLASMA BY COMPARATIVE

GENOMIC ANALYSIS

Dissertation submitted to the Madurai Kamaraj University

In partial fulfillment for the requirement of
Masters of Science in Biotechnology

Submitted by
Reg No: A242009

SCHOOL OF BIOTECHNOLOGY
MADURAI KAMARAJ UNIVERSITY
MADURAI 625 021

May 2004
To
THE
SMALL AND POWERFUL
Non-coding RNA
DECLARATION

I declare that this dissertation entitled Non-coding RNA prediction of

clinically important Mycoplasma using comparative genome analysis submitted by
me in partial fulfillment for the requirement of Masters of Science in Biotechnology to
the Madurai Kamaraj University is based on the work carried out by me in the School of
Biotechnology, Madurai Kamaraj University, Madurai under the guidance and
supervision of Dr. Z. A. Rafi, Reader, School of Biotechnology, Madurai Kamaraj
University, Madurai. I also declare that this dissertation or any part of it has not been
submitted elsewhere for any other degree or diploma.

Madurai-21 Regn. No.:A242009

May 7, 2004
ACKNOWLEDGEMENTS

I owe my gratitude to DR. Z.A. RAFI for his guidance and supervision in
this project. His care and concern has been the driving force for me all through this work.
I am thankful for his constant advice and encouragement. I am thankful to Prof.
S.Krishnaswamy for introducing me to the field of Bioinformatics.

I would also like to thank my classmates Anurag, Basanth, Dinesh, Geeta,

Hridesh, Kaiser, Netrapal, Subhanjan, Sucharitha, Vijay, for their support and company
during the past two years, that made my stay in Madurai a memorable one. I would like
to thank Deepak for his help in creating a C programme.

My special thanks are due to my roommate and friend Santosh for his
constructive criticism for my mistakes. I acknowledge my special friend Ayushi who has
been my rich source of encouragement and entertainment during the last phase at MKU.

I am indebted to the entire School of Biotechnology for making my M.Sc

an intellectually stimulating experience.

I also acknowledge the Dept. of Science and Technology, Government of

India, for its financial support since last five years through Kishore Vaigyanik Protsahan
Yojana and Dept. of Biotechnology, Government of India, for supporting this project.
CONTENTS

1. Briefing
2. Introduction
3. Review of Literature
4. Materials
5. Methods
6. Results
7. Discussion
8. References
BRIEFING

Small untranslated RNA molecules are found in all kingdoms of life.

Many of them that are discovered till date are conserved between closely related
organisms with a characteristic secondary structure. These were found to regulate
diverse functions – mainly regulation of gene expression. Non-coding RNAs (ncRNAs)
are difficult to detect biochemically or to predict by traditional sequence analysis.

To search the ncRNAs that may play an important role in the life cycle of
pathogenic Mycoplasma, we used a well established computational strategy that
distinguish conserved RNA secondary structures from a background of other conserved
sequences using probabilistic models of expected mutational patterns in pairwise
sequence alignments.

We report here the complete genome screening for ncRNA done with this
method on the available completely sequenced six Mycoplasma genomes using
comparative sequence analysis. The screen resulted in several putative ncRNAs.
Majority of the predicted ncRNA sequences are in the range of 130-160 nucleotides and
the number of the ncRNAs predicted was in proportion to the length of their genome size
except for the one genome. Our candidate ncRNAs showed similarity with few of the
biochemically characterized ncRNAs in bacteria as well as eukaryotes. This suggests the
broadly conserved nature of the ncRNAs across the other kingdoms of life. This finding
places our putative ncRNAs as suitable candidates for the drug discovery and
developmental studies of the Mycoplasma.

1
INTRODUCTION

Central dogma of Molecular Biology defined a general pathway for

expression of genetic information stored in DNA, transcribed into transient mRNA &
decoded on ribosomes with the help of adapter RNA to produce proteins which in turn
perform all enzymatic and structural functions in the cell. According to this view RNAs
play a rather accessory role and the complexity of a given organism is defined by the
constellation of proteins encoded by the genome. However, discovery of RNAs
performing enzymatic and other functional roles in the cell complicated the existing
picture.

Discovery of RNaseP catalysis nature and self splicing activity of group I

introns suggested that functions of RNA go far beyond a passive role in the expression of
protein coding genes. More recent discoveries attributed a variety of regulatory roles to
RNA that includes control of plasmid replication, transposition in prokaryotes and
eukaryotes, phage development, viral replication, bacterial virulence, global circuits in
bacteria in response to environmental changes, or developmental control in lower
eukaryotes.

The above reviewed functions suggest that RNAs which are considered as
non functional RNAs are not only molecular fossils left from time immemorial. Analyses
of several sequenced genomes suggest that protein-coding genes alone are not enough to
account for the complexity of higher organisms. Genomic analysis showed that with an
increase of an organism’s complexity the protein coding contribution of the genome
decreases. It is estimated that about 98% of transcriptional output of eukaryotic and upto
10% of prokaryotic genomes in RNA does not encode for any protein.

2
In this context, ncRNA are defined as heterogeneous transcripts that have
a wide functional spectrum. Broadly, ncRNAs can be divided into two classes:

1. Housekeeping RNAs that are constitutively expressed and required for normal
functions and viability of the cell.
2. Regulatory ncRNAs, by contrast, include those that are expressed at certain stages
of an organism’s development or cell differentiation, or as a response to external
stimuli.

Many of these ncRNAs were discovered by chance while researchers were

studying individual genetic systems. NcRNA species have been difficult to detect by
targeted experimental procedures or by traditional computational approaches.

An attempt has been made in the present study to screen for the ncRNAs
of the completetly sequenced and clinically important Mycoplasm a* genomes by
co mparat ive sequence analysis.

*M.penetrans, M.mycoides, M.gallisepticum, M.pulmonis, M.pneumoniae, M.genetalium

3
LITERATURE

Generally the gene finding algorithms assumes that the target is a protein
coding gene that produces mRNA and they fail to scan or target towards ncRNAs.
However, a few computational strategies have recently emerged to detect these ncRNAs
which can be classified into the following four categories:

Sequence similarity analysis: This is simply searching a newly sequenced genome for
similarity against the known ncRNAs [Lowe et al., 1991; Lowe et al., 1999; Zwieb et al,
1999].

Transcriptional Signal analysis: It is based on the fact that ncRNAs are transcribed but
not translated. So, this is a systematic approach that searches for ncRNA genes that has
transcriptional signals but not translational signals [Argman et al., 2001; Olivas et al.,
1997]

Statistical analysis: This involves the analysis of base composition statistics of non-
coding regions in comparison to coding regions [Shattner, 2002]

Comparative genomic analysis: Sequences conferring important characteristics are

conserved across related genomes. Similar assumptions have been made in case of
ncRNAs also. A comparative analysis approach of related genes is used to screen
ncRNAs across the related genomes [Elena Rivas 2001; Elena Rivas et al., 2001;
Wassarman et al., 1999].

The aim of the current study is to find ncRNAs that may play important
role in determining pathogenesis of clinically important Mycoplasma. The current study
was carried out using comparative genomic analysis approach. This selection was done
on the basis that this Mycoplasma shares the common characteristic disease causing
ability. So, a comparative genomic analysis is assumed to highlight the group of ncRNAs
that help in pathogenesis.

4
In our approach to predict ncRNAs by comparative genomics we used a
computational tool – QRNA [Elena Rivas 2001], the heart of our project. The following
information details about its evolution and how it works.

There had been some earlier explored RNA gene finding approaches but
with limited success [Elena Rivas 2000]. Early hypothesis in this regard was that
biologically functional RNA structures may have more stable predicted secondary
structures than would be expected for a random sequence of the same base composition
[Chen JH et al., 1990; Le SY et al., 1988; 1990]. Although to a certain extent the above
hypothesis is true, it has been reported that stable predicted secondary structures alone
cannot give positive expected signal, since the predicted stability of structural RNAs is
not sufficiently distinguishable from the predicted stability of random sequence to use as
the basis for a reliable ncRNA gene finding algorithm [Elena Rivas 2000]. Nonetheless,
conserved RNA secondary structure remained a best hope for an exploitable statistical
signal in ncRNA genes. Hence, the above approaches were coupled to comparative
sequence analysis for determination of additional statistical signals [Elena Rivas 2001].

The comparative sequence analysis for ncRNA genes has its basis from
the work which used BLASTN programme to locate genomic regions with significant
sequence similarity between two related bacterial species. A computational tool
CRITICA analyzed the pattern of mutation in these ungapped, aligned conserved regions
for evidence of coding structure [Badger 1999]. For example mutations to synonymous
codons get positive scores, while aligned triplets that translate to dissimilar amino acids
get negative scores. The programme then subsequently extends any coding assigned
ungapped seed alignments into complete ORFs.

5
QRNA is an extension of CRITICA to identify structural RNA regions. The
extensions include:
1. using fully probabilistic models;
2. adding a third model of pairwise alignments constrained by structural RNA
evolution;
3. allowing gapped alignments; and
4. allowing for the possibility that only part of the pairwise alignment may represent
a coding region or structural RNA, because a primary sequence alignment may
extend into flanking non-coding or nonstructural conserved sequence.

These extensions add complexity to the approach. It also uses probabilistic modeling
methods and formal languages to guide our construction. Further pair – Hidden Markov
Models and pair – Stochastic Context Free Grammar were used to produce three
evolutionary models for coding, structural RNA or something else. Given three
probabilistic models and a pairwise sequence alignment to be tested, QRNA can calculate
the Bayesian posterior probability that an alignment should be classified as coding,
structural or something else.

QRNA screens for conserved RNA secondary structures. It detects

various non-genic sequences with conserved RNA structures, including rho-independent
terminators, rRNA spacers, transcriptional attenuators in ribosomal protein and amino
acid biosynthetic operons, other cis-regulatory RNA structures, and even certain
repetitive elements forming pseudo knots, stem loops, palindromic sequences etc.,.

The predicted targets are referred as ncRNA genes, but it must be

understood that this really meant a conserved RNA secondary structure that may or may
not turnout to be an independent functional ncRNA gene upon further analysis.

6
MATERIALS

System configuration

Hardware specification:

 Machine Name : Pentium IV

 CPU Speed : 2.8GHz
 RAM Memory : 512MB
 Hard disk : 80GB

Operating system specifications:

 Red hat Linux 9.0

 Microsoft Windows XP

Packages installed and Applications used:

 Red Hat Linux 9.0

 EMBOSS-2.8.0
 WU BLAST 2.0
 QRNA
 Microsoft Office
 Perl 5.0

Selected Genomes for the study:

 Mycoplasma penetrans
 Mycoplasma mycoides
 Mycoplasma gallisepticum
 Mycoplasma pulmonis
 Mycoplasma pneumoniae
 Mycoplasma genetalium

7
METHODS
Downloading the genomes of Mycoplasma:
Folder containing various formats of genomes was downloaded from
NCBI ftp site ftp://ftp.ncbi.nlm.nih/Bacteria/Mycoplasma_Species for each of the
organisms selected. The formats should include fasta format of the whole genome
nucleotide sequence (accession_number.fna file), protein table format that constitute the
coordinates of the starting and ending regions of the protein coding regions
(accession_number.ptt file).

Preparing range file of intergenic regions:

Range file preparation involves three steps starting from manipulation of
the coordinates of protein coding regions.
 Getting coordinates of the protein coding regions –
Protein table containing file was opened in Microsoft Word and the option Convert:Table
to Text & Text to Table was used to make a table of just protein coding region
coordinates.
 Getting coordinates for intergenic regions –
The protein coding coordinates were pasted into a Microsoft Excel file and simple
mathematical options were used to obtain the coordinates of intergenic regions.
 Making a Range file –
Final step of making a range file was done by copying the intergenic region coordinates
into notepad file and is given as input for a C programme that selects only the intergenic
regions whose length is greater than 49 nucleotides for further use as input file in emboss
applications (reason).

Extracting the intergenic regions from the genomes:

Extractseq application in the emboss suite was used to get each intergenic
region in the genome separately in fasta format. This procedure was repeated for each
genome.
Syntax: extractseq –regions @rangefile –separate

8
Making Genome databases:
A database formatting programme obtained within the WU BLAST 2.0
suite was used to make databases. Each database constituted five genomes excluding the
genome with which the database is subjected to BLAST.
Syntax: xdformat –n –o database_name

Similarity search:
Similarity search for the intergenic regions of each genome was done by
blastn programme with default parameters from WU BLAST 2.0 suite against a database
that doesn’t contain the organism’s genome.
Syntax: blastn database_name nucleotide_query >output_file_name

Parsing WU BLAST 2.0 outputs:

The output file of WU BLAST 2.0 needs to be parsed by a perl script.
This parsing is done with the default parameters using - blastn2qrnadepth.pl available
along with the QRNA-2.0.1 suite. The result of the parsing will give three output files
and one of the files, with a file_name.q extension will be used as input for the QRNA
application.
Syntax: blastn2qrnadepth.pl -g query_organism file_name

Non-coding RNA prediction:

The file_name.q file obtained from the parsed blast file was used as input
for QRNA with window size 150 and moving 50 nucleotides each time. An option –B
was used to avoid false positive scores.
Syntax: qrna –w 150 –x 50 –B input_file_name > output_file_name

9
Extraction of loci identified as ncRNA:
Perl script phase_count_fast.pl was used to prune the QRNA output to get
the actual independent genomic regions that are identified as RNAs using default
parameters. The nucleotide sequence of the predicted ncRNA was extracted by the same
procedure used for extracting the intergenic regions.
Syntax: phase_count_fast.pl file_name query_org database_org

10
RESULTS

Intergenic regions were rich source for the presence of ncRNAs. As a first
step, the contribution of the intergenic region to the genome of the organism was
calculated. Graph 1 show the length of the selected genomes and Graph 2 displays the
percentage of the intergenic regions in the genome. From the graph it was clear that the
intergenic sequences were very low compared to the protein coding regions. This agrees
with the common feature of the prokaryotes which processes only small percentage of
intergenic regions [Mattick 2001].

The number of intergenic sequences determined was high and it was found
that several intergenic sequences were of small stretches. Since biochemically
characterized ncRNA genes had a minimum length of 50 nucleotides, only the stretches
that contained more than or equal to 50 nucleotides in length were alone considered. This
curing was done by an in-house C programme. Graph 3 displays the intergenic regions
present before and after curing. It has been observed that nearly half of the intergenic
regions were eliminated based on the above criteria.

The current analysis is based on the prediction of conserved secondary

structures and comparative genomic studies based on similarity of the existing genomes.
Hence, databases of groups of organisms under study were created. Each database was a
collection of genomes of other five similar organisms excluding the one which was under
study. The organism under study was searched for similarity against a database
containing genomes of five related organisms. Table 1 lists the organism and database
contents against which the organism is searched for similarity. The table also indicates
the number of similar hits that would be fed in as an input to the QRNA after using the
perl script blastn2qrnadepth.pl. The perl script is used for filtering of hits below the
threshold level as described in the methods above. This in turn shows the relative
proportion of similarity existing between the organisms with respect to genome size. The
results of the perl script were displayed in Graph 4. The graph indicates that almost all
the selected genomes showed a proportionate increase in the number of similarity hits

11
found with respect to the genome size, except M.gallisepticum. This suggests that this
particular organism may have different characteristic sequence compared to the other
selected organisms.

The similarity hits that were selected above the set threshold were
evaluated by QRNA using a window scanning approach. A window size of 150
nucleotides and extension of 50 nucleotides was chosen to minimize the CPU time taken
by the QRNA.

Putative ncRNA output results received from the QRNA for each organism is shown in
Graph 5. Here again, the ncRNAs predicted show a proportional increase in their
number compared with respect to their genome size, except M.gallisepticum.

Spread of the length of the putative ncRNAs was plotted in Graph 6. The
graph shows the range i.e., the smallest and the longest ncRNAs predicted for each
organism together with the average length as pointed by the horizontal line.

12
1. Mycoplasma genetalium G37 complete genome - 0..580074
480 proteins
Location Strand Length PID Gene Synonym Code COG Product

735..1829 + 364 3844620 MG001 - - - DNA polymerase III, subunit beta (dnaN)

1829..2761 + 310 1045670 MG002 - - - dnaJ-like protein

2846..4798 + 650 1045671 MG003 - - - DNA gyrase subunit B (gyrB)
4813..7323 + 836 1045672 MG004 - - - DNA gyrase subunit A (gyrA)
7295..8548 + 417 1045673 MG005 - - - seryl-tRNA synthetase (serS)
8552..9184 + 210 1045674 MG006 - - - thymidylate kinase (tmk)
………….. …. ….. ……….. ………. .. .. .. ….....................................

2.
735 1829
1829 2761
2846 4798
4813 7323
7295 8548
8552 9184
…… ……

3.
2762 2845
4799 4812
7224 7294
8549 8551
…… ……

Fig1: (1) Protein table format of the Mycoplasma genetalium genome showing the annotation of the protein coding regions and the
names of the characterized and putative proteins.
(2) Coordinates of the protein coding regions alone obtained after a series of conversions from Table to Text and Text to
Table option in Microsoft Word.
(3) Coordinates of the intergenic sequences alone obtained after a simple mathematical application use in Microsoft Excel.

13
Genome Size Comparision Oraganism Genome size
M.genetalium 580,074
M.pneumoniae 8,16,394
M.gen
M.pulmonis 9,63,879
M.pne M.gallisepticum 9,96,422
M.pul M.mycoides 12,11,703
M.penetrans 13,58,633
M.gal
M.myc
M.gen- Mycoplasma genetalium
M.pen M.pne- Mycoplasma pneumoniae
M.pul- Mycoplasma pulmonis
0 500000 1000000 1500000 M.gal- Mycoplasma gallisepticum
M.myc- Mycoplasma mycoides
Genome length M.pen- Mycoplasma penetrans

Graph1: GENOME LENGTH COMPARISION OF THE MYCOPLASMA

100%

80%

60%

40%

20%

0%
M.pen M.myc M.gal M.pul M.pne M.gen

Intergenic region Protein Coding Region

Graph2: BAR GRAPH SHOWING THE PERCENTAGE OF INTERGENIC

REGION IN THE GENOME OF MYCOPLASMA

14
1. Starting Ending Length
1 734 734
2762 2845 84
4799 4812 14
7324 7294 0
8549 8551 3
9185 9156 0
9922 9923 2
11253 11251 0
12041 12068 28
12726 12701 0
13566 13569 4
14434 14395 0
15317 15555 239
……. ……. ….

2. Starting Ending Length

1 734 734
2762 2845 84
15317 15555 239
……. ……. …..

Fig2: (1) Intergenic sequence coordinates and their length in the Mycoplasma genetalium as
obtained after the simple mathematics tool application in Microsoft Excel. Intergenic regions
exist with a gap of 1 nucleotide to as many as thousands of nucleotides (not shown here).
(2) Intergenic regions curated by the C programme to remove the regions whose length is
less than 50 nucleotides. One can easily notice that the number of the intergenic regions decreases
considerably after curing.

#this is Mycoplasma genetalium G37 range file

1 734
2762 2845
15317 15555
19760 19824
20356 20543
28449 28650
36714 36977
38979 39127
47423 47580
…… ……

Fig3: This figure shows an example of the first few coordinates of the range file created for
Mycoplasma genetalium for use in the emboss application.

15
1200 Curing of Intergenic Regions

No. of Intergenic Regions

Before
1000
After
800
600
400
200
0
M.pen M.myc M.gal M.pul M.pne M.gen
Before 1037 1016 726 782 689 480
After 643 572 290 376 282 122

Graph3: GRAPH SHOWING THE CULLING OF

THE INTERGENIC SEQUENCES BY THE C
PROGRAMME THAT SELECTS THE REGIONS
WHOSE LENGTH IS GREATER THAN OR EQUAL
TO 50 NUCLEOTIDES ONLY

16
>L43967_1_734 Mycoplasma genetalium G37 intergenic sequence
TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATT
ATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATAC
ATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATT
TTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATA
CTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAAT
ATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGT
AGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTG
AACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCAT
TAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAA
GCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATA
ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAA
CCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAA
AATTTTGGGAAAAA
>L43967_2762_2845 Mycoplasma genitalium G37 intergenic sequence
AAAACCTTTCATTTTTAATGTGTTATAATTATTTGTTATGCCATAAATTTAGTTTGTGGC
AAAAGCTTCTGTACTGTTTATTTA
>L43967_15317_15555 Mycoplasma genitalium G37 intergenic sequence
ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT
GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA
ACTGAGCTATACTTCCAAGCATAATCCTAAGGGTATTTAACTAATTATTATAACAATTTT
AATTTAACCAAAATACCCCTCGAATTTTAACAGTTTTTATAATCAAAACAGCTAATTTT
>L43967_19760_19824 Mycoplasma genitalium G37 intergenic sequence
ATAAATTTAATAGTGTTGAAAGACAAACATTATTAATTTTTGATCAGCTAAATAAAACAA
AGCAA
>L43967_20356_20543 Mycoplasma genetalium G37 intergenic sequence
CTCAAAAAACTAATACATCAAACTTCAACCGTTTACTTTTTTATGAACAAGCACTACAAA
GGTTTTATGAAGAATTATTTCAAATAGATTATTTAAGAAGATTTGAAAACATTCCCATTA
AAGATAAGAATCAAATTGCGCTTTTTAAAACTGTTTTTGATGATTACAAAACCATTGATT
TAGCAGAA

Fig4: Result of the extractseq application in emboss suite which gives sequences of interest. The figure shows the fasta format of
the first few intergenic sequences of the Mycoplasma genetalium obtained from the extractseq application given the range file and
whole genome sequence as input.

17
Organism Database Organisms in No. of Blastn
Created Database hits

M.gallisepticum
M.genetalium
M.penetrans ggmpnpudb M.mycoides 1852
M.pneumoniae
M.pulmonis
M.gallisepticum
M.genetalium
M.mycoides ggpppdb M.mycoides 1787
M.penetrans
M.pneumoniae
M.pulmonis
M.genetalium
M.mycoides
M.gallisepticum gempppdb M.penetrans 850
M.pneumoniae
M.pulmonis
M.gallisepticum
M.genetalium
M.pulmonis ggmpepndb M.mycoides 1012
M.penetrans
M.pneumoniae
M.gallisepticum
M.genetalium
M.pneumoniae ggmpepudb M.mycoides 565
M.penetrans
M.pulmonis
M.gallisepticum
M.mycoides
M.genetalium gampppdb M.penetrans 386
M.pneumoniae
M.pulmonis

Table1: This table shows the databases created with the WU BLAST 2.0 application and the
organism against which the database is searched for similarity.

18
BLASTN 2.0MP-WashU [03-Mar-2004] [linux24-i686-ILP32F64 2004-03-03T16:23:09]

Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri USA.

Reference: Gish, W. (1996-2004) http://blast.wustl.edu

Notice: this program and its default parameter settings are optimized to find
nearly identical sequences rapidly. To identify weak protein similarities
encoded in nucleic acid, use BLASTX, TBLASTN or TBLASTX.

Query= L43967_1_734 Mycoplasma genetalium G37 intergenic sequence

(734 letters; record 1)

Database: gal.fasta
5 sequences; 5,347,031 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done

WARNING: hspmax=1000 was exceeded by 1 of the database sequences, causing the

associated cutoff score, S2, to be transiently set as high as 81.

Smallest
Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N

gb|U00089| Mycoplasma pneumoniae M129, intergenic sequence 692 1.8e-26 1

emb|BX293980.1| Mycoplasma mycoides subsp. mycoides SC ge... 675 1.1e-25 1
dbj|BA000026| Mycoplasma penetrans, intergenic sequence 602 2.1e-22 1
emb|AL445566| Mycoplasma pulmonis (strain UAB CTIP) inter... 539 1.3e-19 2
gb|AE015450.1| Mycoplasma gallisepticum strain R intergen... 528 4.6e-19 1

19
>gb|U00089| Mycoplasma pneumoniae M129, intergenic sequence
Length = 816,394

Plus Strand HSPs:

Score = 692 (109.9 bits), Expect = 1.8e-26, P = 1.8e-26

Identities = 410/664 (61%), Positives = 410/664 (61%), Strand = Plus / Plus

Query: 90 TTAATACTGTTAAATTATATTCAATCAATACATATATAATATTATTAAAATACT-TGATA 148

| |||||| | ||||| || | | | | | | ||| | |||| | | ||
Sbjct: 130 TAAATACTAATCTTCTATATAGTATAGAGAAACTTTTTCT-TTAACATAATATTATCTTA 188

Query: 149 AGTATTATTTAGATATTAGACAAAT-ACTAATTTTA-TATTGCTTTAATACT-TAATAAA 205

| ||||||||| || || | | | | || | ||| |||| |||| || | | |||
Sbjct: 189 A-TATTATTTACCTACTA-ATAGCTTAATATTATTAGTATTTATTTAGTATTATGCTAA- 245

Query: 206 TACTACTTATGTATTAAGTAAATATTACTGTAATACTAATAA-C-AATATTATTAC-AAT 262

||||| | ||||| | ||||||| | || || || || | ||||||||| |||
Sbjct: 246 TACTATGCAGATATTATCTTAATATTA-TCTA-TAGTATTAGGCTAATATTATTCTTAAT 303

Query: 263 ATGCTAGAATAATATTGCTAGTATCAATAATTACTAATATAGTATTAGGAAAATACCATA 322

|| || ||| | ||| | || || || || | ||||| || ||||| | |
Sbjct: 304 ATT-TAT--TAAGGTA-CTAA-AGCATTACCTA-TAGGTGA-TATTATGACAATACTAAA 356

Query: 323 ATAAT-ATTTCTAC-ATAATACTAAGTTAATACTATGTGTAGAATAATAAATAATCAGAT 380

| | | | || | || || | || | ||| | | || | || | || |
Sbjct: 357 GTGGTTAGTATTATTAGGGTATTAT-TCAA-AGTAT-TCTCCAACACTATTCCCTTAGCT 413

Fig5: Output of the blastn programme from the WU BLAST 2.0 run with the intergenic sequences of Mycoplasma genitalium
against the database containing the intergenic sequences of the other five Mycoplasma genomes: M.gallisepticum, M.mycoides,
M.penetrans, M.pneumoniae, M.pulmonis.(The alignment is only partially shown). The blastn programme was run with default
parameters.

20
No. of Blast hits

M.gen 386
M.pne 565
M.pul 1012
M.gal 850
M.myc 1787
M.pen 1852

Graph4: GRAPH SHOWING NUMBER OF BLAST

HITS FOR EACH GENOME

21
>L43967_15317_15555-1>179-Mycoplasma
ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT
GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA
ACTGAGCTATACTTCCAAGCATAATCCTAAGGGTAT-TTAACTA-ATTATTATAACAATT
T
>gb-U00089--19096>19275-Mycoplasma
ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT
GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA
ACTGAGCTATACTTCCAGGCAAAATCTTC-GTACAGGTTCGCTTCATAATTATATTAATT
T
>L43967_15317_15555-1<239-Mycoplasma
AAAATTA-GCTGTTTTGATTATAAAAACTG-TTAAAATTCGAGGGGTATTTTGGTTAAAT
TAAAATTGTTATAATAATTA-GTTA-AATACCCTTAGGATT-ATGCTTGGAAGTATAGCT
CAGTTGGTTAGAGCACACCCCTGATAAGGGTGAGGTCGATGGTTCAAGTCCATTTACTTC
CACCAATAAT---GGGGATGTAGCTCAACTGATAGAGCACCTGATTTGCACTCAGGAGGT
TGAGGGT
>gb-AE015450.1--417273>417511-Mycoplasma
AATTTTACGC-GTTGTTATTACCAATCGAAATTAAAAATTAAGCAG-ATATTCTTTAA--
TGAGCT-GA-AT--TAATTATGTTATAATTCATATGGCAATCACGACTGGAAGTATAGCT
CAGCTGGTTAGAGCACACCCCTGATAAGGGTGAGGTCGATGGTTCAAGTCCATTTACTTC
CACCAGTTTTTTTGGGGACGTAGCTCAATTGATAGAGCACCTGATTTGCACTCAGGAGGT
CGAGGGT
>L43967_19760_19824-5<65-Mycoplasma
TTGCTTTGTTTTATTTAGCTGATCAA-AAATTAATAATGTTTGTCTTTCAACACTATTAA
AT
>emb-BX293980.1--57200>57261-Mycoplasma
TTGTTTTGTTTTATTTAATTGATCAATAAATTGATTTAGTTTATCTTTATTTATTAATAA
AT

Fig6.1: This figure shows one of the output file of the perl script blastn2qrnadepth.pl run with the blastn result of M.gentalium
intergenic sequences Vs Mycoplasma database as input. The first file is named .q as extension (here genblast.q). This is the file used as
input for the qrna programme in QRNA-2.0.1 suite. This consists of a collection of sequences in fasta format, where two sequences are
the two component of an alignment with the gaps left in place.

22
1. FILE: genblast
DIR: /home/kalyankpy/coput2/blast//

FIRST TRIMMING
Minimum length = 1
Maximum Evalue = 0.01
Minimum %id = 0
Maximum %id = 100

SECOND TRIMMING
Alignments culled by = SC
Depth of alignments = 1
shift =1

113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence

Total # alignments: 1121 After First trimming: 88 After Second trimming: 2
57-QUERY: L43967_325878_326027 Mycoplasma genitalium G37 intergenic sequence
Total # alignments: 152 After First trimming: 3 After Second trimming: 3
72-QUERY: L43967_386409_386461 Mycoplasma genitalium G37 intergenic sequence
Total # alignments: 292 After First trimming: 0 After Second trimming: 0
68-QUERY: L43967_364415_364533 Mycoplasma genitalium G37 intergenic sequence
Total # alignments: 155 After First trimming: 1 After Second trimming: 1
……………………………………………………………………………………………….

Total #Queries 122

Total #Alignments 53927 ave_len = 309.5
After first trimming 18851 ave_len = 552.6
After second trimming 386 ave_len = 404.2

Fig6.2: This figure shows the second file of the output from the perl script blastn2qrnadepth.pl run with the blastn result of
M.gentalium intergenic sequences Vs Mycoplasma database as input. This is a file named with .q.rep (here, genblast.q.rep) as
extension that has the report of the BLASTN alignment that have been pruned in the process of creating a file with .q as extension
according to the options used in the perl script.

23
#---------------------------------------------------------------------------------
# qrna 2.0.1 (Tue Aug 19 11:30:55 CDT 2003) using squid 1.5m (Sept 1997)
#---------------------------------------------------------------------------------
# PAM model = BLOSUM62
#---------------------------------------------------------------------------------
# RNA model = /mix_tied_linux.cfg
# RIBOPROB matrix = /RIBOPROB85-60.mat
#---------------------------------------------------------------------------------
# seq file = /home/kalyankpy/perlscriptresult/genblast.q
# #seqs: 772 (max_len = 3420)
#---------------------------------------------------------------------------------
# window version: window = 150 slide = 50 -- length range = [0,9999999]
#---------------------------------------------------------------------------------
# 1 [both strands] (sre_shuffled)
>L43967_1_734-90>722-Mycoplasma (664)
>gb-U00089--130>767-Mycoplasma (664)

length of whole alignment after removing common gaps: 664

Divergence time (variable): 0.401
[alignment ID = 61.75 MUT = 29.67 GAP = 8.58]

length alignment: 150 (id=61.33) (mut=32.67) (gap=6.00)(sre_shuffled)

posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43)
posY: 0-149 [0-144](145) -- (0.37 0.11 0.06 0.46)

L43967_1_734-90 TTAATTTTATTAAAACTATAACTTATTTTTTATAAACATTCTATGTTTTT
gb-U00089--130> TTTATTTTATTAAAATTATAATGTATTTTTGTTAAATTTT.TAATTCTTT

L43967_1_734-90 TAAA.CAAATGAGAAATATAGTAATAAAGCAAATT.TTTTCACCAT.TTT
gb-U00089--130> CAGTGCACATA.CCTATTCGCTAGTTAA.ACGATAAAGTTAAAGAAATTT

L43967_1_734-90 TTTATTATATCA.AAATTTAAAGAAAAATCTGAAAATTATCTATAATGTG
gb-U00089--130> TTCTTTATATTCTAAATTT.AAAAATCTTCTCAATATAATACATAAT.TC

LOCAL_DIAG_VITERBI -- [Inside SCFG]

24
OTH ends *(+) = (0..[150]..149)
OTH ends (-) = (0..[150]..149)
COD ends *(+) = (120..[27]..146)
COD ends (-) = (41..[12]..52)
RNA ends *(+) = (0..[21]..20)
RNA ends (-) = (0..[150]..149)
winner = OTH
OTH = 184.281 COD = 166.408 RNA = 179.710
logoddspostOTH = 0.000 logoddspostCOD = -17.873 logoddspostRNA = -4.571
sigmoidalOTH = 4.571 sigmoidalCOD = -17.932 sigmoidalRNA = -4.571

Fig7: This is the qrna output file obtained by the syntax: qrna –w 150 –x 50 –a –B genblast.q.
The qrna is a c programme written to evaluate the given alignment for its ability to forma a structural RNA. The above fig is the
partial output of the qrna run with a scanning window option (here window size = 150, extension size = 50 nucleotides).
 Every new blast alignment starts with two lines: “>Query_name” followed by “>Subject_name”
 “Divergence time” indicates the particular time parameterization of QRNA used. By default QRNA decides on the divergence time
(in this case it is 0.401) given the percentage identity of the alignment (61.75%).
 Each new analyzed window starts with the line: length alignment:
For each window and for each sequence in the alignment we have a line of the form:
posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43)
The first pair of numbers represents the first and last coordinates of the window respect to the beginning of the alignment. The pair
of numbers in brackets represents the mapping of the window into the coordinate system of sequence X (after removing the gaps).
The adjacent number in parenthesis is the length of that segment in sequence X. finally the four decimal numbers in parenthesis are
the fraction of A, C, G, T’s respectively in the segment of sequence X involved in that particular window.
 For each model, and for each strand, you are given the actual local regions (they could be more than one per model and strand) that
score according to the model. The notation is (from..[Length]..to). Coordinates for both strands are given relative to the positive
strand. The * indicates the strand with the strongest signal for a given model.
 For the given scoring algorithm (here it is local viterbi by default) we get three row of numbers:
Row 1: The scores of the alignment under each of the three models. The null model is a forth model which assumes that the two
sequences in the alignment are independent from each other.
Row 2: The two (COD and RNA) log-odds posterior probabilities respect to the OTH model.
Row 3: The three sigmoidal scores calculated using the other two models as null models. The model with the highest sigmoidal
score is the winner.

25
---------------Some General Statistics-------------------
FILE: ./genblast.2qrna
method: LOCAL_DIAG_VITERBI
Cutoff: 5

max id: 100

# blastn hits: 386

# windows: 2574
---------------------------------------------------------

---------------Statistics by Windows---------------------
# windows: 2574

RNA>0: 41/2574
RNA>cutoff: 2/2574

COD>0: 2/2574
COD>cutoff: 0/2574

in phases: 2045/2574
RNA: 2/2045
COD: 0/2045
OTH: 2043/2045

in transitions: 0/2574
RNA/COD: 0/0
RNA/OTH: 0/0
COD/OTH: 0/0
RNA/COD/OTH: 0/0
---------------------------------------------------------

---------------Statistics for RNA loci ():-------------------

# loci: 1
ave_length: 196.00

1-loci L43967_167180_175806-Mycoplasma 7457 7653 (197) 2 RNA -39.20

26.19

Fig8: This is the output of the phase_count_fast.pl perl script that extracted the RNA loci (with RNA
score larger than a cutoff set with option –u, here –u is default set to 5). The script identified 1
independent locus in Mycoplasma genitalium that scores as RNA above 5 bits out of the 2574
windows from the 386 blastn alignments. The listed coordinates of each of the locus is of the
following form:
num-loci name_seq(seq_length)loc_from loc_to(loc_lenght)number_wind
type_loc COD_sc RNA_sc
Therefore,
1-loci L43967_167180_175806-Mycoplasma 7457 7653 (197) 2 RNA -39.20 26.19
means that the first M.genitalium locus corresponds to intergenic sequence named
L43967_167180_175806-Mycoplasma. The locus has a length of 197 nucleotides and covers the
region from 7457 to 7653. Two different windows have contributed to this RNA locus, and the
average sigmoidal score for the coding model is -39.20 bits, while the average sigmoidal score for the
RNA model is 26.19 bits.

26
No. of ncRNAs

60 52
50
40 39
40
30
20 12
10 4
1
0
M.pen M.myc M.gal M.pul M.pne M.gen

Graph5: REPRESENTATION OF THE

PUTATIVE ncRNAS PREDICTED BY QRNA

Range of Non-coding RNA

350
300
250
Length (nt)

200
150
100
50
0
M.pen M.myc M.gal M.pul M.pne M.gen

Graph6: GRAPH SHOWING THE LENGTH

RANGE OF NON-CODING RNAs.
(Vertical bars represent the spread of scores and
horizontal bar represent the average)

27
Fig9: Analysis of the BLASTN alignments between M.gentalium intergenic sequences and the intergenic sequence database of
M.gallisepticum, M.mycoides, M.penetrans, M.pneumoniae, M.pulmonis. Alignments have been grouped by percentage identity.
Each figure represents the histogram of the number of alignments bined in each percentage identity interval. Green colour
histogram shows the total number of windows analyzed. Blue colour histogram shows the windows that score as RNA or Coding
sequence above cutoff of 0 bits.

a) Figure showing the number of sequences scored as Coding regions in the windows analyzed.
b) Figure showing the number of sequences scored as RNAs in the windows analyzed.

Fig10: Analysis of the BLASTN alignments between M.gentalium intergenic sequences and the intergenic sequence database of
M.gallisepticum, M.mycoides, M.penetrans, M.pneumoniae, M.pulmonis. Alignments have been grouped by percentage identity.
Each figure represents the scores of all the alignments as a function of the percentage identity of the alignments. “*” represents the
average of the RNA or Coding sequence scores. The error bars correspond to one standard deviation.

a) Figure showing the average of the scores scored as Coding regions in the windows analyzed.
b) Figure showing the average of the scores scored as RNAs in the windows analyzed.

28
29
genblast.qrna.COD.id--sigmoidal LOD
30
<len> = 303 +/- 198 ID=[100:0] total_counts [361]
real COD-phase_counts_above: 0 [16//361]

25
NUMBER OF WINDOWS // qrna 2.0.1

0
100 95 90 85 80 75 70 65 60 55 50
% ID
Fig 9a: Figure showing the number of sequences30
scored as coding regions in the windows analyzed
genblast.qrna.RNA.id--sigmoidal LOD
30
<len> = 303 +/- 198 ID=[100:0] total_counts [361]
real RNA-phase_counts_above: 0 [28//361]

25
NUMBER OF WINDOWS // qrna 2.0.1

0
100 95 90 85 80 75 70 65 60 55 50
% ID
Fig 9a: Figure showing the number of sequences
31 scored as RNAs in the windows analyzed
genblast.qrna.COD.id--sigmoidal LOD
60
<len> = 303 +/- 198 ID=[100:0]
ave COD lodscore above: 0 [16//361]

40
COD sigmoidal LODSCORE // qrna 2.0.1

-20

-40

-60

-80
100 95 90 85 80 75 70 65 60 55 50
% ID
Fig 10a: Figure showing the average of the scores32scored as coding regions in the windows analyzed
genblast.qrna.RNA.id--sigmoidal LOD
20
<len> = 303 +/- 198 ID=[100:0]
ave RNA lodscore above: 0 [28//361]
10
RNA sigmoidal LODSCORE // qrna 2.0.1

-10

-20

-30

-40

-50

-60
100 95 90 85 80 75 70 65 60 55 50
% ID
Fig 10b: Figure showing the average of the scores
33 scored as RNAs in the windows analyzed.
DISCUSSION

The intergenic regions in prokaryotes are small; however, their presence has long
been shown to play a significant role in these organisms. The percentage of the intergenic regions
in Mycoplasma genomes varied from 9.2% in M.genetalium (smallest) to 18% M.mycoides
(largest) genome. Number of intergenic regions was spread to over 122 locations (least in
M.genetalium) to 643 (highest in M.mycoides). Average length of intergenic regions ranged from
234 (in M.penetrans) to 441 (in M.genetalium) nucleotides. This indicates that the average length
of intergenic regions in a smaller genome is greater compared to the average length in a larger
genome. This could be due to the appearance of large number of small interspersing regions
(intergenic regions with few nucleotides only) in M.penetrans that results in the reduction of the
average length.

The QRNA was used with an option of shuffling the sequence. This estimates the
false positives that could arise with the given sequence composition and length. Earlier results in
similar ncRNA predictions in E.coli have shown 85% true positives (Rivas and Eddy 2001). The
predicted loci in the present study are regions of conserved secondary structures that include
ncRNAs and need not be individual ncRNAs alone.

To assess the significance of the prediction, the predicted loci were searched for
similarity against the already known and biochemically characterized ncRNAs obtained from the
ncRNA database at http://biobases.ibch.poznan.pl/nc (updated till 2002).

The putative non-coding RNAs were searched against known Mycoplasma ncRNA
data (only two ncRNAs have been characterized in Mycoplasma capricolum). The results
indicated that one of the putative ncRNA from the current study was showing a good percentage of
identity (60%) with one of the two biochemically available Mycoplasma ncRNA data viz.,
Mc_MCS4 ncRNA obtained from Mycoplasma capricolum. The Mc_MCS4 has already been
shown to have extensive similarity with the eukaryotic U6 snRNA also. This strengthens our
candidate ncRNA to be a possible functional entity. Since the number of ncRNA in

34
biochemically determined database was small the database was expanded to include other
prokaryotic ncRNAs.

The results indicated that a stretch of nucleotides in the putative ncRNA was
showing significant similarity to MicF RNAs from E.coli, S.typhi, and K.pneumoniae. Since MicF
was known to regulate the expression of OmpF and the stretch of nucleotides showing similarity
were conserved across all the species, one can possibly say that the putative ncRNA stretch may be
a MicF counterpart in Mycoplasma. Another ncRNA showing significant similarity to E.coli
OxyS RNA was also noticed. OxyS RNA was known to modulate gene expression in response to
Hydrogen peroxide, a common chemical produced by mammals in response to infection. So, this
proposes a defense mechanism operating in Mycoplasma.

The database was further expanded to include eukaryotic ncRNAs that constituted
the characterized miRNA and development regulating RNAs and protein function modifying
RNAs. The putative ncRNAs were found to have more than 60% identity with a number of
miRNAs from mouse, humans, A. thaliana and C.elegans. Fig. 11a shows a blastn hit showing
71% identity against one of the putative ncRNA from M.mycoides. This clearly shows that the
putative ncRNA does have a conserved secondary structure similar to the well characterized stem
loop region of C.briggsae miRNA. Fig 11b shows a blastn hit having an identity of 63% from the
same M.mycoides with the characterized ncRNA obtained from the development regulating RNA
of Homosapiens.

35
>cbr-mir-268 MI0000541 Caenorhabditis briggsae miR-268 stem-loop
Length = 79

Minus Strand HSPs:

Score = 95 (20.3 bits), Expect = 0.22, P = 0.19

Identities = 33/46 (71%), Positives = 33/46 (71%), Strand = Minus / Plus

Query: 64 CAAAC-CTCTAAACTT-CTAAGAACTTCTTCTTCTTCTTCTTCTTC 21
|| || | || | || || | |||||| || ||||||||||||
Sbjct: 34 CAGACACACTCA-CTGACTCACTGCTTCTTGTTTTTCTTCTTCTTC 78

Fig 11a: A 71% identity blastn hit obtained for one of the putative ncRNA from M.mycoides. This
clearly shows that the putative ncRNA have a conserved secondary structure similar to the well
characterized stem loop region C.briggsae miRNA.

Significant hits were found with the development regulating ncRNAs included
those from Homosapiens also.

>Hs_NTT
Length = 17,572

Plus Strand HSPs:

Score = 116 (23.5 bits), Expect = 0.025, P = 0.024

Identities = 60/94 (63%), Positives = 60/94 (63%), Strand = Plus / Plus

Query: 11 TATTTAATATTTATAATTGCTATTTAGCATCTTAAAA-AAGA-CG-TCTTT-AAA-TATA 65
|| |||| | || ||| | | || | |||| | ||| | |||| ||| ||||
Sbjct: 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394

Query: 66 GATAGTTATACTAATTAGAAAATAGTTAAT-AAG 98
|||| | || ||| |||| | ||||| |||
Sbjct: 5395 GATATCTCTAGCAAT-AGAAGAGTTTTAATTAAG 5427

Fig 11b: A sample sequence hit having an identity of about 63% from the same M.mycoides with
the characterized ncRNA obtained from development regulating RNA of Homosapiens.

36
These results indicate that the ncRNAs were conserved across other kingdoms of
life. Since the ncRNAs are generally conserved across a wider spectrum, the ncRNAs can
possibly play variant roles in different cellular processes, though the role is yet to be proved
biochemically (which still remains as a challenging task).

The very existence and expression profile of ncRNAs is not predictable, their
functional analysis remains challenging. Given the predicted ncRNAs, the task can be handled
with reduced burden.

37
REFERENCES

1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped
BLAST and PSI-BLAST: a new generation of protein database search programme.
Nucleic Acids Research 1997, 25:3389
2. Argman L, Hershberg R, Vogel J, Bejerano G, Wagner EG, Margalit H and Altuvia S:
Novel small RNA-encoding genes in the intergenic regions of Escherichia coli. Current
Biology 2001, 11:941
3. Badger JH and Oslen GJ: CRITICA: Coding Region Identification Tool Involving
Comparative Analysis. Molecular Biology and Evolution 1999, 16:512
4. Capara MG, Wilsen TW: RNA: versatility in form and function. Nature Structural
Biology 2000, 7:831
5. Elena Rivas & Sean R Eddy: Secondary structure alone is generally not statistically
significant for the detection of non-coding RNAs. Bioinformatics 2000, 16:583
6. Elena Rivas, Sean R Eddy: QRNA: A non-coding RNA genefinder using comparative
genome sequence analysis (ftp://ftp.genetics.wustl.edu/pub/eddy/software/qrna.tar.z) 2001
7. Elena Rivas, Robert J Klein, Thomas A Jones and Sean R Eddy: Computational
identification of non-coding RNAs in Escherichia coli by comparative genomics.
Current Biology 2001, 11:1369
8. Elena Rivas & Sean R Eddy: Non-coding RNA gene detection using comparative
sequence analysis. BMC Bioinformatics 2001, 2:8
9. Erdmann VA, Barciszewska MZ, Szymanski M, Hochberg A, de Groot N, Barciszewski J:
The non-coding RNAs as riboregulators. Nucleic Acids Research 2001, 29:189
10. Gish W: WU-BLAST 2.0 (http://blast.wustl.edu/) 2003
11. Huttenhofer A, Kiefmann M, Meier-Ewert S, O’Brien J, Lehrach H, Bachellerie JP,
Brosius J: RNomics: an experimental approach that identifies 201 candidates for
novel, small, non-messenger RNAs in mouse. EMBO journal, 2001, 20:2943

38
12. Lowe TM, Sean R Eddy: tRNAscan-SE: a program for improved detection of transfer
RNA genes in genomic sequence. Nucleic Acids Research, 1997, 25:955
13. Lowe Sean R Eddy: A computational tool for methylation guide snoRNAs in yeast.
Science, 1999, 283:1168
14. Maciej Szymanski and Jan Barciszawski: Beyond the proteome: non-coding regulatory
RNAs. Genome Biology 2002, 3: 0005.1
15. Mattick JS: Non-coding RNAs: the architects of eukaryotic complexity. EMBO
Reports 2001, 2:986
16. Olivas WM, Muhlrad D, Parker R: Analysis of the yeast genome: identification of new
non-coding and small ORF-containing RNAs. Nucleic Acids Research 1997, 25:4619
17. Sean R Eddy: Non-coding RNA genes. Current Opinion in Genetics and Development
1999, 9:695
18. Sean R Eddy: Non-coding RNA genes and modern RNA world. Nature Review
Genetics 2001, 2:919
19. Shchattner P: Searching for RNA genes using base-composition statistics. Nucleic
Acids Research 2002, 30:2076
20. Wasserman KM, Zhang A, Storz G: Small RNAs in Escherichia coli. Trends in
Microbiology 1999, 7:37
21. Zweib, Wower I, Wower J: Comparative sequence analysis of tmRNA. Nucleic Acids
Research 1999, 27:2063

Biology-Holt Vocab Review Worksheets
100% (3)
Biology-Holt Vocab Review Worksheets
68 pages
Faitheroic Generation Enterprises Rules and Regulations
100% (1)
Faitheroic Generation Enterprises Rules and Regulations
4 pages
FYUP Microbiology Syllabus
No ratings yet
FYUP Microbiology Syllabus
70 pages
B.SC MicroBiology Syllubus 01.01.2017
No ratings yet
B.SC MicroBiology Syllubus 01.01.2017
42 pages
University of Dar Es Salaam
No ratings yet
University of Dar Es Salaam
11 pages
Tribhuvan University Institute of Science and Technology: Course of Study For Four Year B. Sc. Microbiology
No ratings yet
Tribhuvan University Institute of Science and Technology: Course of Study For Four Year B. Sc. Microbiology
23 pages
Enzyme Engineering
100% (1)
Enzyme Engineering
33 pages
Principles of Food Processing and Preservation Upd - 231102 - 132255
No ratings yet
Principles of Food Processing and Preservation Upd - 231102 - 132255
41 pages
1 Metagenomics Principles and Applications PPinto
100% (1)
1 Metagenomics Principles and Applications PPinto
44 pages
Mini Project Work: Production of Antifungal Organic Soap
No ratings yet
Mini Project Work: Production of Antifungal Organic Soap
24 pages
Rev. Lect 2. MOLECULAR TECHNIQUES IN DIAGNOSTIC MICROBIOLOGY
No ratings yet
Rev. Lect 2. MOLECULAR TECHNIQUES IN DIAGNOSTIC MICROBIOLOGY
75 pages
Basic Fresher Resume For Students Template
No ratings yet
Basic Fresher Resume For Students Template
2 pages
Basic of Antimicrobial Drugs
100% (1)
Basic of Antimicrobial Drugs
34 pages
2.1 Basic Microbiology Techniques
No ratings yet
2.1 Basic Microbiology Techniques
31 pages
Melkamu MSC 1
100% (1)
Melkamu MSC 1
60 pages
019 BSC Medical Microbiology - 24062017
No ratings yet
019 BSC Medical Microbiology - 24062017
74 pages
Isolation and Enumeration of Bacteria in Water and Food
100% (1)
Isolation and Enumeration of Bacteria in Water and Food
30 pages
1.Classification of Microbes.
No ratings yet
1.Classification of Microbes.
33 pages
Conventional Methods and Techniques Used in Bacterail Identification
100% (1)
Conventional Methods and Techniques Used in Bacterail Identification
29 pages
Jhu784 Notes
100% (1)
Jhu784 Notes
1,435 pages
Unit 3 Microbial Growth, Aseptic Inoculation and Streak Isolation Formatted 3-28-18
No ratings yet
Unit 3 Microbial Growth, Aseptic Inoculation and Streak Isolation Formatted 3-28-18
17 pages
Endospore Stain Questions
100% (1)
Endospore Stain Questions
7 pages
Biotech Lab Manual-10 SEM
No ratings yet
Biotech Lab Manual-10 SEM
18 pages
DBT JRF in A Nutshell
100% (1)
DBT JRF in A Nutshell
29 pages
Using BLAST: FASTA Format
0% (1)
Using BLAST: FASTA Format
3 pages
Microbiology Important Question 2024
No ratings yet
Microbiology Important Question 2024
10 pages
Isolation, Identification, & Preservation of Indusrtial
100% (1)
Isolation, Identification, & Preservation of Indusrtial
16 pages
MIB1001 Lab Manual 2015-16
No ratings yet
MIB1001 Lab Manual 2015-16
68 pages
Bacteriophage-Lysis & Lysogeny
No ratings yet
Bacteriophage-Lysis & Lysogeny
16 pages
Art 2018
No ratings yet
Art 2018
243 pages
Microbiology
67% (3)
Microbiology
45 pages
Microbiology Lab V2
100% (1)
Microbiology Lab V2
5 pages
DNA Extraction From Fungi, Yeast, and Bacteria
100% (1)
DNA Extraction From Fungi, Yeast, and Bacteria
2 pages
Control of Microorganisms by Chemotherapeutic Agents
No ratings yet
Control of Microorganisms by Chemotherapeutic Agents
10 pages
00md Microbiology Curriculum Syllabus 2018
No ratings yet
00md Microbiology Curriculum Syllabus 2018
32 pages
WHO
No ratings yet
WHO
1 page
Clinical Bacteriology: Fawad Mahmood M.Phil. Medical Laboratory Sciences
No ratings yet
Clinical Bacteriology: Fawad Mahmood M.Phil. Medical Laboratory Sciences
8 pages
DDBJ, Bilogical Data Bases, Bioinformatics Data Base
No ratings yet
DDBJ, Bilogical Data Bases, Bioinformatics Data Base
2 pages
Biological Classification - Lecture Notes
No ratings yet
Biological Classification - Lecture Notes
18 pages
Genetic Code HM
100% (2)
Genetic Code HM
30 pages
Sequence Analysis
No ratings yet
Sequence Analysis
6 pages
Hep B Request Form
No ratings yet
Hep B Request Form
1 page
UNIT 1 Introduction To Food Microbiology Microbiology
No ratings yet
UNIT 1 Introduction To Food Microbiology Microbiology
20 pages
Handsout Pract 16 Lactophenol Cotton Blue Stain
No ratings yet
Handsout Pract 16 Lactophenol Cotton Blue Stain
2 pages
To BRFV
No ratings yet
To BRFV
85 pages
8 Instrumentation
100% (1)
8 Instrumentation
25 pages
Microbial Nutrition
100% (2)
Microbial Nutrition
12 pages
Harmonisasi Metode E.coli
No ratings yet
Harmonisasi Metode E.coli
47 pages
2 6 13 Test For Specificed Microorganisms
No ratings yet
2 6 13 Test For Specificed Microorganisms
5 pages
Microbiology Final Exam Review
No ratings yet
Microbiology Final Exam Review
3 pages
KT124A GeNei™ Bacterial Transposons Teaching Kit
No ratings yet
KT124A GeNei™ Bacterial Transposons Teaching Kit
11 pages
Lecture 16 Escherichia Coli
100% (1)
Lecture 16 Escherichia Coli
18 pages
2016 Master of Science in Nutrition Program Brochure
No ratings yet
2016 Master of Science in Nutrition Program Brochure
16 pages
Diagnostic Medical Microbiology Clinical Correlation
No ratings yet
Diagnostic Medical Microbiology Clinical Correlation
19 pages
Capsule, Flagella, Pili, Endospores
No ratings yet
Capsule, Flagella, Pili, Endospores
21 pages
Clinical Report of DR. Md. Murshidul Ahsan On E.coli &amp Salmonella
No ratings yet
Clinical Report of DR. Md. Murshidul Ahsan On E.coli &amp Salmonella
51 pages
Bioinformatics Notes
No ratings yet
Bioinformatics Notes
104 pages
Nutrition and Culture
No ratings yet
Nutrition and Culture
31 pages
Real-Time PCR Automations of Quant Studio 5 and MA6000 Plus
No ratings yet
Real-Time PCR Automations of Quant Studio 5 and MA6000 Plus
15 pages
APA Unknown Lab Report
No ratings yet
APA Unknown Lab Report
9 pages
RNA World Facts
From Everand
RNA World Facts
William Martin
No ratings yet
Survey: Are You Registered For PHD
No ratings yet
Survey: Are You Registered For PHD
5 pages
BioEdit: A User-Friendly Biological Sequence Alignment Editor and Analysis Program For Windows 95/98/NT
No ratings yet
BioEdit: A User-Friendly Biological Sequence Alignment Editor and Analysis Program For Windows 95/98/NT
14 pages
Role of AC3 On Geminiviral Replication: A Journey Through Past
No ratings yet
Role of AC3 On Geminiviral Replication: A Journey Through Past
25 pages
Identification of Sequence Elements Regulating Promoter Activity and Replication of a Monopartite Begomovirus-Associated DNA β satellite
No ratings yet
Identification of Sequence Elements Regulating Promoter Activity and Replication of a Monopartite Begomovirus-Associated DNA β satellite
10 pages
Next Generation Sequencing Presentation
No ratings yet
Next Generation Sequencing Presentation
28 pages
National Symposium Poster
No ratings yet
National Symposium Poster
1 page
Mechanism of Rolling Circle Replication in Geminiviruses: Role of Replication Enhancer (Ren/Al3)
100% (1)
Mechanism of Rolling Circle Replication in Geminiviruses: Role of Replication Enhancer (Ren/Al3)
18 pages
PHD Course Work DrKhanna
No ratings yet
PHD Course Work DrKhanna
63 pages
2.3 Biomolecules
No ratings yet
2.3 Biomolecules
25 pages
Nucleic Acids: AP Biology
No ratings yet
Nucleic Acids: AP Biology
19 pages
USABO Study Questions
No ratings yet
USABO Study Questions
45 pages
A1.2 Nucleic Acids
100% (1)
A1.2 Nucleic Acids
83 pages
B.SC - III Zoology (GBK)
No ratings yet
B.SC - III Zoology (GBK)
7 pages
Structural Functional Comparative Genomics
No ratings yet
Structural Functional Comparative Genomics
17 pages
NEET - Biology - Cell Structure and Function
67% (3)
NEET - Biology - Cell Structure and Function
38 pages
Lesson 1 緒論
No ratings yet
Lesson 1 緒論
25 pages
General Science 2024 Mock Test 4th April, 2024
No ratings yet
General Science 2024 Mock Test 4th April, 2024
27 pages
Joyce 2007
No ratings yet
Joyce 2007
1 page
Genetics Analysis and Principles 6th Edition Brooker Test Bank - Free Download Available In PDF DOCX Format
100% (2)
Genetics Analysis and Principles 6th Edition Brooker Test Bank - Free Download Available In PDF DOCX Format
42 pages
PH Dinformationbrochure 2023
No ratings yet
PH Dinformationbrochure 2023
117 pages
Chapter 17
75% (4)
Chapter 17
16 pages
AP Bio Gene Expression & Regulation MC (5 Steps To A 5)
No ratings yet
AP Bio Gene Expression & Regulation MC (5 Steps To A 5)
7 pages
Victor Ambros
No ratings yet
Victor Ambros
6 pages
Post Transcriptional Modification
100% (1)
Post Transcriptional Modification
31 pages
Presentation - 12th Grade
No ratings yet
Presentation - 12th Grade
11 pages
A2 Edexcel Biology Revision Notes
0% (1)
A2 Edexcel Biology Revision Notes
79 pages
Learning Activity Sheets Physical Science
No ratings yet
Learning Activity Sheets Physical Science
62 pages
Bio Molecule
No ratings yet
Bio Molecule
14 pages
The Central Dogma of Molecular Biology
No ratings yet
The Central Dogma of Molecular Biology
4 pages
(DOPA-2) Molecular Basis of Inheritance
No ratings yet
(DOPA-2) Molecular Basis of Inheritance
6 pages
Starter Promoter Terminator
No ratings yet
Starter Promoter Terminator
4 pages
Aquatic Sciences and Fisheries Abstracts
No ratings yet
Aquatic Sciences and Fisheries Abstracts
298 pages
DNA To Proteins Practice
No ratings yet
DNA To Proteins Practice
2 pages
Rev of Pat and Gen 7 TH Edi
No ratings yet
Rev of Pat and Gen 7 TH Edi
707 pages
Molecular Biology of the Cell The Problems Book, 6th Edition-trang-1-trang-1
No ratings yet
Molecular Biology of the Cell The Problems Book, 6th Edition-trang-1-trang-1
29 pages
RNA Metabolism: Durriya Naeem Khan
No ratings yet
RNA Metabolism: Durriya Naeem Khan
26 pages
Central Dogma
No ratings yet
Central Dogma
50 pages

Non-Coding Rna Prediction of Clinically Important Genomic Analysis

Uploaded by

Non-Coding Rna Prediction of Clinically Important Genomic Analysis

Uploaded by

NON-CODING RNA PREDICTION OF CLINICALLY

IMPORTANT MYCOPLASMA BY COMPARATIVE

Dissertation submitted to the Madurai Kamaraj University

I declare that this dissertation entitled Non-coding RNA prediction of

Madurai-21 Regn. No.:A242009

I would also like to thank my classmates Anurag, Basanth, Dinesh, Geeta,

I am indebted to the entire School of Biotechnology for making my M.Sc

I also acknowledge the Dept. of Science and Technology, Government of

Small untranslated RNA molecules are found in all kingdoms of life.

Central dogma of Molecular Biology defined a general pathway for

Discovery of RNaseP catalysis nature and self splicing activity of group I

Many of these ncRNAs were discovered by chance while researchers were

*M.penetrans, M.mycoides, M.gallisepticum, M.pulmonis, M.pneumoniae, M.genetalium

Comparative genomic analysis: Sequences conferring important characteristics are

QRNA screens for conserved RNA secondary structures. It detects

The predicted targets are referred as ncRNA genes, but it must be

 Machine Name : Pentium IV

Operating system specifications:

 Red hat Linux 9.0

Packages installed and Applications used:

 Red Hat Linux 9.0

Selected Genomes for the study:

Preparing range file of intergenic regions:

Extracting the intergenic regions from the genomes:

Parsing WU BLAST 2.0 outputs:

Non-coding RNA prediction:

The current analysis is based on the prediction of conserved secondary

1829..2761 + 310 1045670 MG002 - - - dnaJ-like protein

Graph1: GENOME LENGTH COMPARISION OF THE MYCOPLASMA

Intergenic region Protein Coding Region

Graph2: BAR GRAPH SHOWING THE PERCENTAGE OF INTERGENIC

2. Starting Ending Length

#this is Mycoplasma genetalium G37 range file

No. of Intergenic Regions

Graph3: GRAPH SHOWING THE CULLING OF

Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri USA.

Reference: Gish, W. (1996-2004) http://blast.wustl.edu

Query= L43967_1_734 Mycoplasma genetalium G37 intergenic sequence

WARNING: hspmax=1000 was exceeded by 1 of the database sequences, causing the

gb|U00089| Mycoplasma pneumoniae M129, intergenic sequence 692 1.8e-26 1

Plus Strand HSPs:

Score = 692 (109.9 bits), Expect = 1.8e-26, P = 1.8e-26

Query: 90 TTAATACTGTTAAATTATATTCAATCAATACATATATAATATTATTAAAATACT-TGATA 148

Query: 149 AGTATTATTTAGATATTAGACAAAT-ACTAATTTTA-TATTGCTTTAATACT-TAATAAA 205

Query: 206 TACTACTTATGTATTAAGTAAATATTACTGTAATACTAATAA-C-AATATTATTAC-AAT 262

Query: 263 ATGCTAGAATAATATTGCTAGTATCAATAATTACTAATATAGTATTAGGAAAATACCATA 322

Query: 323 ATAAT-ATTTCTAC-ATAATACTAAGTTAATACTATGTGTAGAATAATAAATAATCAGAT 380

Graph4: GRAPH SHOWING NUMBER OF BLAST

HITS FOR EACH GENOME

113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence

Total #Queries 122

length of whole alignment after removing common gaps: 664

length alignment: 150 (id=61.33) (mut=32.67) (gap=6.00)(sre_shuffled)

LOCAL_DIAG_VITERBI -- [Inside SCFG]

max id: 100

# blastn hits: 386

---------------Statistics for RNA loci ():-------------------

1-loci L43967_167180_175806-Mycoplasma 7457 7653 (197) 2 RNA -39.20

Graph5: REPRESENTATION OF THE

Range of Non-coding RNA

Graph6: GRAPH SHOWING THE LENGTH

Minus Strand HSPs:

Score = 95 (20.3 bits), Expect = 0.22, P = 0.19

Plus Strand HSPs:

Score = 116 (23.5 bits), Expect = 0.025, P = 0.024

You might also like