0% found this document useful (0 votes)
13 views

Practical Lab Exercise for Intro Bioinf II (2)

Uploaded by

Tuana Özden
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Practical Lab Exercise for Intro Bioinf II (2)

Uploaded by

Tuana Özden
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Name: Tuana Özden Student ID: 2101549

Introduction to Bioinformatics
Practical Assignment II

Exercise on Public Databases


And
Molecular Phylogeny

1. Introduction 2
2. Identification of the Unknown Sequences 2
3. Gene Expression 3
4. Genomics Information 4
5. Protein Domain Architecture 6
6. Protein Motif and Modification analysis (Post-translational modification) 7
7. Protein Secondary Structure 7
8. Tertiary or 3D structures 8
9. Protein-Protein Interactions 8
10. PubMed Bibliography Search 9
11. Molecular Phylogeny on IRG gene and proteins 11

1
1. Introduction
The increasing amount of biological and sequencing data enables scientist to retrieve
large amount of information regarding their research interest from public repositories
(databases). Especially, the increasing number of high throughput analysis
techniques increases the volume of databases almost on a daily basis. Here, we like
to introduce/demonstrate how to acquire a set of properties, useful in gene, genome,
and protein analysis.

2. Identification of unknown sequence


Imagine that you are working in a laboratory of human genetics trying to decipher the
key genes involved in SMA disease in human. As a result of your analysis you have
found a gene fragment. Try to identify the DNA sequence ZZZ.fa (even or odd
depending on your student ID). “This is a partial DNA fragment, you would like to
identify the gene that is originated from”. Use the UCSC Genome browser blat search
that is available on “https://genome.ucsc.edu”.
Note: You can also use NCBI Blast https://blast.ncbi.nlm.nih.gov. However, for the
other steps you will need to use UCSC Genome Browser.

You can open the file ZZZ.fa copy and paste the sequence into Blat search box

Look at the best hit for your DNA query. This should be your target DNA. What is it ?
Take close look by clicking on “browser” link. Zoom_out the browser. Can you see
the gene name ? What is the gene name ? Click on the gene name. Read the
2
description and the instructions. This is called information page. In this page you can
almost every information about the gene and protein.

Please write down the Gene, Chromosome, Strand Name: Survival of motor neuron 2,
centromeric (SMN2), Chromosome 5, +

Description of a Gene: This gene is part of a 500 kb inverted duplication on chromosome


5q13. This duplicated region contains at least four genes and repetitive elements which
make it prone to rearrangements and deletions. The repetitiveness and complexity of the
sequence have also caused difficulty in determining the organization of this genomic region.
The telomeric and centromeric copies of this gene are nearly identical and encode the same
protein. While mutations in the telomeric copy are associated with spinal muscular atrophy,
mutations in this gene, the centromeric copy, do not lead to disease. This gene may be a
modifier of disease caused by mutation in the telomeric copy. The critical sequence
difference between the two genes is a single nucleotide in exon 7, which is thought to be an
exon splice enhancer. Note that the nine exons of both the telomeric and centromeric copies
are designated historically as exon 1, 2a, 2b, and 3-8. It is thought that gene conversion
events may involve the two genes, leading to varying copy numbers of each gene. The full
length protein encoded by this gene localizes to both the cytoplasm and the nucleus. Within
the nucleus, the protein localizes to subnuclear bodies called gems which are found near
coiled bodies containing high concentrations of small ribonucleoproteins (snRNPs). This
protein forms heteromeric complexes with proteins such as SIP1 and GEMIN4, and also
interacts with several proteins known to be involved in the biogenesis of snRNPs, such as
hnRNP U protein and the small nucleolar RNA binding protein. Four transcript variants
encoding distinct isoforms have been described.
Accession Number (RefSeq) and Protein ID: NM_017411, Q16637

3. Gene Expression
Although you can see directly on information page, but it is always better to go and
check the expression profile in its webportal. Please use the GTEx Portal that is
available on “https://www.gtexportal.org/home/”.

Write the name or gene ID to initiate a search to see where and in which tissue it is
expressed mostly ?

Where the gene mostly transcribed?

3
Cervix-Ectocervix

What could be the reason of this high expression in specific tissue or cells?
Because women give birth by contracting their muscles around the cervix while giving birth
and therefore of this contraction this gene most likely present in the cervix.

4. Genomics Information
Please now go to the GenBank “https://www.ncbi.nlm.nih.gov/genbank/” , and insert
the Acession number from RefSeq database (above). When you enter to the RefSeq
Genebank data format. You will see the GeneBank data format.

This is an example. Please use the information above to get the GeneBank data
for your gene of interest.

Please click on the FASTA in order to get the nucleotide data. Once you click on the
FASTA you will get the nucleotide in FASTA format. Please copy the sequence and
go back to UCSC Genome browser (same as above) and blat it against human
genome (hg38). Click on the “browser” to come to the information page. If you cannot
see the information, you can make each track (Conservation, expression etc.)
available from the bottom and refresh.

What is the total length of the transcript on the chromosome?


-27,927

How many exons?

4
-9

Is the Exons conserved across vertebrates or invertebrates?


-Vertebrates

Check also the RepeatMasker content of the region at the bottom of the information page.
Does the region contain any repetitive element? If yes, please click on each repetitive
element and write the name of them.
1-Repeat L2
2- AluJb
3- AluYm1
4- Repeat (AAAT)n
5- AluJo
6- Repeat L2
7- FRAM
8- Repeat L2
9- AluSx1
10- Repeat L2
11- AluSx4
12- AluJb
13-Repeat L2
14-MSTA
15-Repeat L2
16-AluJb
17-Repeat L2
18-AluY
19-Repeat L2
20-MLT1F
21-L1ME1
22-AluJb
23-L1ME1
24-MER57B2
25-L1ME1
26-AluJo
27-L1ME1
28-LTR19C
29-L1ME1
30-MER4B
31-L1ME1
32-L1ME1
33-AluSq
34-L1ME1
35-L1ME1
36-AluSz
37-MIR
38-AluJb
39-AluSx
40-L1MB2
41-AluSp
42-L1MB2

5
43-AluJb
44-AluSg
45-AluJb
46-(TTGTT)n
47-AluSg
48-(TTTTG)n
49-(ATT)n
50-AluJo
51-AluJr
52-GA-rich
53-(TAA)n
54-AluSx
55-LTR104_Mam
56-AluJo
57-(AGTTTG)n
58-AluSg
59-L1ME3D
60-MIRb
61-AluY
62-AluSx1
63-(CCA)n
64-AluSq
65-(TG)n
66-FLAM-C
67-AluY
68-AluYc3
69-AluSz
70-L1MB8
71-AluYj4
72-L1MB8
73-AluY
74-AluSp
75-AluSp
76-AluSp
77-AluY
78-AluSp
79-AluJr
80-AluSx1
81-AluJr
82-L1MC5a

5. Protein Domain Architecture


You can also check domain architecture of your protein by using “http://smart.embl-
heidelberg.de” , There are two modes they are both similar but differs with their
protein databases.

6
Use the Normal mode for now and click all the options

Put your protein sequence into the


search box and select the species
as human if it is asked.

How many different domains or motifs discovered ? What are the name of each Confidently
predicted domains, repeats, motifs and features:
(Please copy and paste the screenshot (whole browser including the date and time))

7
6. Protein Motif and Modification analysis (Post-translational
modification):

Since we have detected the known domain of our protein now we can check whether
any predicted modification, motifs can be found in our protein. Please go now to
motifscan software by typing “https://myhits.isb-sib.ch/cgi-bin/motif_scan”, and put
your protein search box by clicking all the option.

Which kind of modification did you find ?

(Please copy and paste the screenshot (whole browser including the date and time))

8
7. Protein Secondary Structure
It is possible to predict the secondary structure of proteins from the sequence. This is
especially important if the 3D structure of the protein is not available and you want to
use the protein for targeted functional analysis. For example; You have to check the
secondary and if possible 3D structure before designing target peptides for the
antibody generation against your target protein. PSIPRED is a program that does
that based on the amino acid sequence. You can find it at
“http://bioinf.cs.ucl.ac.uk/psipred/”

Before processing the program may give error for invalid character because it does
not accept any empty space or other things. You can clean your protein or DNA
sequence easily by
“http://www.cellbiol.com/scripts/cleaner/dna_protein_sequence_cleaner.php”

9
Predict the secondary structure of your protein. Which secondary structure elements
are predicted and which one is the most frequent in your protein ? Depending on the
options you have selected, this process may take a while. Go on with the next
exercises and come back to this point, when the calculation is finished.
https://predictprotein.org

With predict protein, you can try to determine many properties of protein including
secondary, tertiary, topology, DNA binding domain etc.

(Please copy and paste the screenshot (whole browser including the date and time))

10
8. Tertiary, or 3D structures
For many proteins the three-dimensional structure has been documented by e.g.
crystallography or spectrometry. The Protein Data Bank (PDB) stores the information.
Go to the “https://www.rcsb.org”, and look for your protein. You can type the name of
the protein (above) or PBD ID in the search box. Please remember that not all protein
structure are determined. Can you find your protein there ?

(Please copy and paste the screenshot (whole browser including the date and time))

11
9. Protein-Protein Interactions
Proteins interact with other proteins in the living cell to perform various tasks. The
BioGrid and GeneMania Database collects described interactions between proteins
from published studies and provide it with annotations. Please go to Genemania
“https://genemania.org” and search for your protein. You can also put list of proteins
that you want to see how many of them interact or co-expressed.

(Please copy and paste the screenshot (whole browser including the date and time))

12
10. PubMed Bibliography Search
We are using the information based on the publications. All these databases uses the
publication as a resource. Therefore it is always best to go and search original
publication to answer the such questions “Who is the discoverer of this Gene ? ”
What could be the other functions which are not indicated in the databases ?”

Please go to PubMed (https://pubmed.ncbi.nlm.nih.gov) and type the gene name of


the gene that you identified above and find the publications and labs that are actively
doing research on that gene.

Please write down the title and authors of most recent publication about the
gene that you have found.
1. Comley LH, Kline RA, Thomson AK, Woschitz V, Landeros EV, Osman EY,
Lorson CL, Murray LM. Motor Unit Recovery Following Smn Restoration in
Mouse Models of Spinal Muscular Atrophy. Hum Mol Genet. 2022 May
13
12:ddac097. doi: 10.1093/hmg/ddac097. Epub ahead of print. PMID:
35551393.
2. Du LL, Sun JJ, Chen ZH, Shao YX, Wu LC. NOVA1 promotes SMN2 exon 7
splicing by binding the UCAC motif and increases SMN protein expression.
Neural Regen Res. 2022 Nov;17(11):2530-2536. doi: 10.4103/1673-
5374.339005. PMID: 35535907.
3. Nagy ZF, Pál M, Salamon A, Kafui Esi Zodanu G, Füstös D, Klivényi P,
Széll M. Re-analysis of the Hungarian amyotrophic lateral sclerosis
population and evaluation of novel ALS genetic risk variants. Neurobiol
Aging. 2022 Apr 9;116:1-11. doi: 10.1016/j.neurobiolaging.2022.04.002.
Epub ahead of print. PMID: 35525134.

11. Please download the protein sequences for IRGM from Itslearning and
perform the Multiple Alignment (Clustalw). Please remember that you need
to edit (remove the information.. title before doing the alignment)
(Please copy and paste the screenshot (whole browser including the date and time))

You can also directly download the alignment file.

14
15
16
17
18
19
20
21
22
23
24
25
12. Please download the protein and nucleotide sequences for IRGM from
Itslearning and perform the phylogenetic analysis both for nucleotide and
protein sequences by using MEGA11. Copy and paste the screenshots of
both phylogenetic tree below (Date and Time should be seen).

Protein Sequence

26
Nucleotide Sequence

27
13. Conclusion and Discussion about IRGM gene
Please write your comment about the IRGM gene, what is the function
and whether it is involved in Crohn diseases by proper citation with
the literature search from PubMed. (Please also download a file called
using a reference to learn how to cite the articles within the text)

Please list 3 most recent publication about the IRGM gene in Human.
1-Wang LL, Jin XH, Cai MY, Li HG, Chen JW, Wang FW, Wang CY, Hu
WW, Liu F, Xie D. Corrigendum to "AGBL2 promotes cancer cell
growth through IRGM-regulated autophagy and enhanced Aurora
A activity in hepatocellular carcinoma" [Canc. Lett. 414 (2018) 71-
80]. Cancer Lett. 2022 May 4:215700. doi:
10.1016/j.canlet.2022.215700. Epub ahead of print. Erratum for:
Cancer Lett. 2018 Feb 1;414:71-80. PMID: 35525812.

28
2-Olivieri G, Ceccarelli F, Perricone C, Ciccacci C, Pirone C,
Natalucci F, Spinelli FR, Alessandri C, Borgiani P, Conti F. Fever in
systemic lupus erythematosus: associated clinical features and
genetic factors. Clin Exp Rheumatol. 2022 Mar 23. doi:
10.55563/clinexprheumatol/7x37pf. Epub ahead of print. PMID:
35349414.
3-Liang C, Fan J, Liang C, Guo J. Identification and Validation of a
Pyroptosis-Related Prognostic Model for Gastric Cancer. Front
Genet. 2022 Feb 25;12:699503. doi: 10.3389/fgene.2021.699503.
PMID: 35280928; PMCID: PMC8916103.

(Jena et al., 2020) States that the pathogenesis of autoimmune disorders is


intimately linked to the activation of the type 1 interferon response.
Immunity Related GTPase M (IRGM) deficiency has also been linked to
autoimmune disorders, however the mechanism of action is uncertain. One
of the disorder relevant with IRGM is Chrohn’s Disease (CD) which are
founded in human population and it causes gastrointestinal inflammatory
sickness that can result lesions in mouth and anus (Feuerstein et al., 2017).

References

1-Jena KK, Mehto S, Nath P, Chauhan NR, Sahu R, Dhar K, Das SK, Kolapalli
SP, Murmu KC, Jain A, Krishna S, Sahoo BS, Chattopadhyay S, Rusten TE,
Prasad P, Chauhan S, Chauhan S. Autoimmunity gene IRGM suppresses
cGAS-STING and RIG-I-MAVS signaling to control interferon response.
EMBO Rep. 2020 Sep 3;21(9):e50051. doi: 10.15252/embr.202050051.
Epub 2020 Jul 27. PMID: 32715615; PMCID: PMC7507369.
2-Feuerstein JD, Cheifetz AS. Crohn Disease: Epidemiology, Diagnosis, and
Management. Mayo Clin Proc. 2017 Jul;92(7):1088-1103. doi:
10.1016/j.mayocp.2017.04.010. Epub 2017 Jun 7. PMID: 28601423.

29

You might also like