Introduction to Python for Bioinformatics

An introduction to:
PythonJosé Héctor Gálvez, M.Sc.
Imagesource:www.katie-scott.com

Scripting languages
• Scripting languages are a type of
programming language that are interpreted
instead of compiled.
• They are generally considered high-level and
are usually easier to read and learn.
• Examples:
• Bash (shell scripting)
• R (statistical scripting)
• Perl (general-purpose scripting)
• Python (general-purpose scripting)

• A popular, open-source, multi-platform,
general-purpose scripting language.
• Many extensions and libraries for scientific
computing.
• Current supported versions: 2.7 and 3.5.
Install Python on your computer!
• Official Python distribution:
https://www.python.org/downloads/
• Jupyter (formerly iPython):
https://www.continuum.io/downloads

Learning Goals
1. Understand strings to print and manipulate text
2. Use the open() function to read and write files
3. Understand lists and use loops to go through them
4. Create your own functions
5. Use conditional tests to add more functionality to
scripts

Leaky pipes - A formatting problem
Blergh… All my files are messed up!
They are in the wrong format!
The program I want to use won’t open them!
⎯ Frustrated bioinformatician
• We often require code to parse the output of
one program and produce another file as input
for a specific software.
Parse:
To analyze a text to extract useful information from it.

Objective 1:
Text in
Python

Handling text in Python
Printing text to the terminal:
>>> print(“Hello world”)

• Python interpreter prompt: >>>

• Input: print(“Hello world”)

• Function: print()

• Argument: “Hello world”

Hello world
• Argument: “Hello world”
• Output: Hello world

What happens if we use single quotes?
>>> print(‘Hello world’)
Hello world
We get the same result!!!
• In Python single quotes ‘’ and double
quotes “” are interchangeable.
But, don’t mix them!

What happens if we mix quotes?
>>> print(‘Hello world”)
File "<stdin>", line 1
print('Hello world")
^
SyntaxError: EOL while scanning single-
quoted string
Whoops!

Error messages give us important clues:
^
quoted string

^
quoted string
• File and line containing error.

^
quoted string
• Best guess as to where error is found.

^
quoted string
• Best guess as to where error is found.
• Error type and explanation.

We can save strings as variables:
>>> #My first variable!
>>> dna_seq1 = “ATGTGA”

• A line starting with # is a comment.

• We use the = symbol to assign a variable.
• We can re-assign variables as many times
as we want.
That’s why they’re called variables !

>>> dna_seq1 = “ATGTAA”
• We use the = symbol to assign a variable.
• We can re-assign variables as many times
as we want.
That’s why they’re called variables !

>>> print(dna_seq1)
ATGTAA
• Once assigned, the we can use the
variable name instead of its content.
• Variable names can have letters, numbers,
and underscores.
• They can’t start with numbers.
• They are case-sensitive.
Name your variables carefully!

Any value between quotes is called a string:
>>> type(dna_seq1)
<type ‘str’>
• Strings (‘str’) are a type of object.
• Other types include integers (‘int’),
floats (‘float’), lists (‘list’), etc…
• Strings are mainly used to manipulate text
within Python.
Understanding how to use strings is crucial
for bioinformatics!

String operations
Concatenation
>>> start_codon = ‘ATG’
>>> stop_codon = ‘TGA’
>>> coding_seq = ‘CATATT’
>>> full_seq = start_codon + coding_seq
... + stop_codon
>>> print(full_seq)
ATGCATATTTGA
• To combine strings, we use the + operator

String operations
String length
>>> len(full_seq)
>>>
>>> #len() produces no output
>>> full_lenght = len(full_seq)
>>> print(full_length)
12
>>> type(full_length)
<type ‘int’>
• To find the lenght of a string we can use
the len() function.
• Its return value is an integer (number).

String operations
Turning objects into strings
>>> print(“The length of our seq is ”
... + full_length)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: cannot concatenate 'str' and
'int' objects
• It is not possible to concatenate objects of
different types.

String operations
Turning objects into strings
>>> print(“The length of our seq is ”
... + str(full_length))
The length of our seq is 12
• The str() function turns any object into a
string.

String operations
Substrings
>>> #Let’s print only the coding sequence
>>> print(full_seq[3:9])
CATATT
• To understand how we did it we need to
know how strings are numbered:
A T G C A T A T T T G A
0 1 2 3 4 5 6 7 8 9 10 11
Python always starts counting from zero!!!

String operations
Substrings
CATATT
• How to create a substring:
0 1 2 3 4 5 6 7 8 9 10 11

String operations
Substrings
CATATT
A T G |C A T A T T T G A
0 1 2 [3 4 5 6 7 8 9 10 11
The first number is included (start inclusive).

String operations
Substrings
CATATT
A T G |C A T A T T |T G A
0 1 2 [3 4 5 6 7 8 ]9 10 11
The first number is included (start inclusive).
The second number is excluded (end exclusive).

String operations
Substrings
>>> #We can also print just one letter
>>> print(full_seq[11])
A
• Each character in the string can be called
using their postion (index) number:
0 1 2 3 4 5 6 7 8 9 10 11

String operations
Methods
>>> lower_seq = full_seq.lower()
>>> print(lower_seq)
atgcatatttga
• A method is similar to a function, but it is
associated to a specific object type.
• We call them after a variable of the right type,
using a ‘.’ (period) to separate them.
• In this case, the method .lower() is called
on strings to convert all uppercase
characters into lowercase.

Objective 2:
Files in Python
Image source: www.katie-scott.com

Opening ﬁles
The open() function is used to open files:
>>> my_file = open(“BV164695.1.seq”,”r”)
>>> print(my_file)
<open file ‘BV164695.1.seq', mode 'r' at
0x109de84b0>
• It returns a file object.
• This object is different from other types of
objects.
• We rarely interact with it directly.
• We mostly interact with it through
methods.

Opening ﬁles
The open() function is used to open files:
• The first argument is the path to the file.
• This path should be relative to our working
directory.*
• The second argument is the mode in which
we are opening the file.
• We separate arguments using a comma.
Don’t forget the quotes!

Opening ﬁles
Files can be opened in three modes:
• Read ( “r” ): Permits access to the content
of a file, but can’t modify it (default).
• Write ( “w” ): Enables the user overwrite the
contents of a file.
• Append ( “a” ): Enables the user to add
content to a file, without erasing previous
content.
Don’t confuse write and append,
you could lose a lot of data!

Opening ﬁles
The .read() method extracts file content:
>>> file_content = my_file.read()
>>> print(type(my_file),
... type(file_content))
(<type 'file'>, <type 'str'>)
• Returns the full contents of a file as a string.
• Takes no arguments.
Remember: The .read() method can
only be used on file objects in read mode!

Opening ﬁles
The .write() method writes content into file:
>>> out_file = open(“test_out.txt”,”w”)
>>> hello_world = “Hello world!”
>>> out_file.write(hello_world)
• Writes content into file objects in “w” or “a”
modes.
• Argument must be a string.
The .write() method can
only be used on file objects in write or append mode!

Closing ﬁles
The .close() method flushes a file:
>>> print(out_file)
<open file ’test_out.txt', mode ’w' at 0x
103f53540>
>>> out_file.close()
>>> print(out_file)
<closed file ’test_out.txt', mode ’w' at
0x103f53540>
• Flushing files saves the changes and lets
other programs use it.
It is always good practice to close files after using them!

Objective 3:
Lists and loops

Using lists
A list is an object containing several elements:
>>> nucleic_ac = [“DNA”,”mRNA”,”tRNA”]
>>> print(type(nucleic_ac))
<type 'list'>
• A list is created using brackets [ ].
• The elements are separated by commas.
• List elements can be of any object type.

Using lists
It is possible to mix object types within lists:
>>> number_one = [“one”, 1, 1.0]
>>> numbers_123 = [[“one”, 1, 1.0],
... [“two”, 2, 2.0],[“three”, 3, 3.0]]
We can even make lists of lists!

Using lists
Elements are called using their index:
>>> number_one = [“one”, 1, 1.0]
>>> numbers_123 = [[“one”, 1, 1.0],
... [“two”, 2, 2.0],[“three”, 3, 3.0]]
>>> print(number_one[1],
... type(number_one[1]))
(1, <type 'int'>)
Don’t forget to start counting from zero!

Using lists
>>> number_one = [“one”, 1, 1.0]
>>> numbers_123 = [[“one”, 1, 1.0],
... [“two”, 2, 2.0],[“three”, 3, 3.0]]
>>> print(number_one[2],
... type(number_one[2]))
(1.0, <type ’float'>)

Using lists
>>> number_one = [“one”, 1, 1.0]
>>> numbers_123 = [[“one”, 1, 1.0],
... [“two”, 2, 2.0],[“three”, 3, 3.0]]
>>> print(numbers_123[0],
... type(numbers_123[0]))
(['one', 1, 1.0], <type 'list'>)

Using lists
Elements can be substituted using their index:
>>> numbers_123 = [[“one”, 1, 1.0],
... [“two”, 2, 2.0],[“three”, 3, 3.0]]
>>> numbers_123[0] = [“zero”, 0, 0.0]
>>> print(numbers_123)
[['zero', 0, 0.0], ['two', 2, 2.0],
['three', 3, 3.0]]

Using lists
The .append() method adds elements to lists:
>>> number_one = [“one”, 1, 1.0]
>>> number_one.append(“I”)
>>> print(number_one)
['one', 1, 1.0, 'I']
• Takes only one of argument.
• Doesn’t return anything, it modifies the
actual list.
• It only adds an element to the end of a list.

Using lists
Sublists can also be created using indices:
>>> number_one = [“one”, 1, 1.0,”I”]
>>> number_1 = number_one[1:3]
>>> print(number_1, type(number_1))
([1, 1.0], <type 'list'>)
• Work similar to strings (first inclusive,
last exclusive).

Using loops
Loops make it easier to act on list elements:
>>> nucleic_ac = [“DNA”,“mRNA”,“tRNA”]
>>> for string in nucleic_ac:
... print(string + “ is a nucleic acid”)
...
DNA is a nucleic acid
mRNA is a nucleic acid
tRNA is a nucleic acid

Using loops
Loops have the following structure:
...
• Loop statement:
for ____ in ____ :
Don’t forget the colon!

Using loops
...
• Element name
• Same rules as variable naming.
This variable only exists inside the loop!

Using loops
>>> for acid in nucleic_ac:
... print(acid + “ is a nucleic acid”)
...
• Element name
• Same rules as variable naming.
Chose appropriate names to avoid confusion.

Using loops
...
• Iterable object
• The loop elements will depend on the
type of object.

Using loops
Some basic iterable object types:
Object type Iterable element
List List element
String Individual characters
Open file in ‘r’ mode Individual line in the file
Dictionary Values (in arbitrary order)
Set Set element (in arbitrary order)
The variety of iterable objects makes loops a
very powerful tool in python!

Using loops
...
• The body of the loop is defined with tabs.
• It can be as long as necessary, but all lines
must start with a tab.

Using loops
... print(“I like “ + acid)
...
I like DNA
I like mRNA
I like tRNA

Objective 4:
Functions
Image source: www.katie-scott.com

Creating functions
It is possible to create our own functions:
>>> def gc_content(seq):
... length = len(seq)
... G_content = seq.count(“G”)
... C_content = seq.count(“C”)
... GC_content =(G_content + C_content)
... / float(length)
... return GC_content
...

Creating functions
Function definitions have this structure:
... / float(length)
...
• The definition statement
def ___________:

Creating functions
... / float(length)
...
• The function name
• Same naming rules as variables

Creating functions
... / float(length)
...
• The argument(s) of our function
• Same naming rules as variables
• This part is optional

Creating functions
... / float(length)
...
• The body of the function is defined by tabs
• It can be as long as necessary, but all lines
must start with a tab.

Creating functions
... / float(length)
...
• The return statement (optional)
• Can return one or more objects
• Marks the end of a function

Calling functions
>>> test_seq = “ACTGATCGATCG”
>>> gc_test = gc_content(test_seq)
>>> print(gc_test, type(gc_test))
(0.5, <type 'float'>)
>>> print(GC_content)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'GC_content' is not defined
Once defined, we can call a function:
Variables within the function are not defined outside
of that function!

Other function options
>>> test_seq = “ACTGATCGATCG”
>>> print(gc_content(test_seq))
0.5
>>> test_seq = “ACTGATCGATCGC”
0.538461538462
Let’s improve our function:
I don’t want that many numbers!

The round() function lets us round the result:
... / float(length)
... return round(GC_content,2)
...
0.54

A second argument gives more flexibility:
>>> def gc_content(seq,sig_fig):
... / float(length)
... return round(GC_content,sig_fig)
...
>>> print(gc_content(test_seq,2))
0.54
>>> print(gc_content(test_seq,3))
0.538

We can call a function with keyword arguments:
>>> def gc_content(seq,sig_fig):
... / float(length)
...
>>> print(gc_content(seq=‘ACGC’,sig_fig=1))
0.8
>>> print(gc_content(sig_fig=1,seq=‘ACGC’))
0.8

We can give our functions default values:
>>> def gc_content(seq,sig_fig=2):
... / float(length)
...
0.54
>>> print(gc_content(test_seq,sig_fig=3))
0.538

Objective 5:
Conditional
tests

Conditions
Conditions are pieces of code that can only
produce one of two answers:
- True
- False
When required, python tests (or evaluates) the
condition and produces the result.
>>> print( 3 == 5 )
False
>>> print( 3 < 5 )
True
>>> print( 3 >= 5 )
False
These are not strings!

Conditions
The following symbols are used to construct
conditions:
Symbol Meaning
== Equals
> < Greater than, less than
>= <= Greater and less than, or equal to
!= Not equal
in Is a value in a list
is Are the same object*
Remember to use two equals signs
when writing conditions!

Conditions
Let’s evaluate more conditions:
>>> print( len(“ATGC”) > 5 )
False
>>> print( “ATGCGATT”.count(“A”) != 0 )
True
>>> print( “U” in [“A”,”C”,”G”,”T”] )
False
>>> print( “A” in [“A”,”C”,”G”,”T”] )
True
>>> print( len([“A”,”C”,”G”,”T”]) == 4 )
True
>>> print( “ATGCGATT”.isupper())
True
>>> print( “ATGCGATT”.islower())
False

Conditional tests
An if statement only executes if the condition
evaluates as True:
>>> test_seq = ‘ATTGCATGGTATCTACGG’
>>> if len(test_seq) < 10:
... print(seq)
...
>>>
>>> test_seq = ‘ATTGCATGG’
>>> if len(test_seq) < 10:
... print(seq)
...
ATTGCATGG
• If statements have similar structure to loops

Conditional tests
An if statement only executes if the condition
evaluates as True:
>>> seq_list = [‘ATTGCATGGTATCTACGG’,
... ‘ATCGCA’,’ATTTTCA’,’ATTCATCGAT’]
>>> for seq in seq_list:
... if len(seq) < 10:
... print(seq)
...
ATCGCA
ATTTTCA
When nesting commands,
be careful with the tabs !

Conditional tests
An else statement only executes when the if
statement(s) preceding it evaluate as False:
>>> seq_list = [‘ATTGCATGGTATCTACGG’,
... ‘ATCGCA’,’ATTTTCA’,’ATTCATCGAT’]
... if len(seq) < 10:
... print(seq)
... else:
... print(str(len(seq))+ ‘ base seq’)
...
18 base seq
ATCGCA
ATTTTCA
10 base seq
Remember: else statements
never have conditions!

Conditional tests
To create if/else blocks with multiple
conditions, we use elif statements:
... if len(seq) < 10:
... print(seq)
... elif len(seq) == 10:
... print(seq[:5] + ‘...’)
... else:
...
18 base seq
ATCGCA
ATTTTCA
ATTCA...

Boolean operators
Boolean operators let us group several
conditions into a single one:
>>> seq_list = [‘ATTGCATGGTATCTACGG’,’AT’,
... ‘ATCGCA’,’ATTCATCGAT’]
... if len(seq) < 3 or len(seq) > 15:
... else:
... print(seq)
...
18 base seq
2 base seq
ATCGCA
ATTCATCGAT

Boolean operators
There are three boolean operators in python:
Boolean operator Boolean operation Result
and
False and False False
True and True True
True and False False
or
False or False False
True or True True
True or False True
not
not True False
not False True

True/False functions
Functions can return True or False:
>>> def is_long(seq,min_len=10):
... if len(seq) > min_len:
... return True
... else:
... return False
...
... if is_long(seq):
... print(‘Long sequence’)
... else:
... print(‘Short sequence’)
...

... if is_long(seq):
... else:
...
Long sequence
Short sequence
Short sequence
Short sequence

... if is_long(seq,5):
... else:
...
Long sequence
Short sequence
Long sequence
Long sequence

Conclusion
• Python is a very powerful language that is
currently used for many things:
• Bioinformatics tool development
• Pipeline deployment
• Big Data analysis
• Scientific computing
• Web development (Django)
The best way to learn to code
is through practice and
by reading other developers’ code!

References & Further Reading
• Official python documentation:
https://www.python.org/doc/
• “Python for Biologists” by Dr. Martin Jones
www.pythonforbiologists.com
• E-books with biological focus
• CodeSkulptor: http://www.codeskulptor.org/
• Codecademy python course:
https://www.codecademy.com/learn/python
• Jupyter project: http://jupyter.org/index.html

Introduction to Python for Bioinformatics

Recommended

More Related Content

What's hot (20)

Similar to Introduction to Python for Bioinformatics (20)

Recently uploaded (20)

Introduction to Python for Bioinformatics