0% found this document useful (0 votes)
749 views

Bioinformatics With R Cookbook Sample Chapter

Chapter No.1 Starting Bioinformatics with R Over 90 practical recipes for computational biologists to model and handle real-life data using R

Uploaded by

Packt Publishing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
749 views

Bioinformatics With R Cookbook Sample Chapter

Chapter No.1 Starting Bioinformatics with R Over 90 practical recipes for computational biologists to model and handle real-life data using R

Uploaded by

Packt Publishing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Bioinformatics with R Cookbook

Paurush Praveen Sinha











Chapter No. 1
" Starting Bioinformatics with R"
In this package, you will find:
A Biography of the author of the book
A preview chapter from the book, Chapter NO.1 "Starting Bioinformatics with R"
A synopsis of the books content
Information on where to buy this book




About the Author
Paurush Praveen Sinha has been working with R for the past seven years. An engineer
by training, he got into the world of bioinformatics and R when he started working as a
research assistant at the Fraunhofer Institute for Algorithms and Scientific Computing
(SCAI), Germany. Later, during his doctorate, he developed and applied various machine
learning approaches with the extensive use of R to analyze and infer from biological data.
Besides R, he has experience in various other programming languages, which include
J ava, C, and MATLAB. During his experience with R, he contributed to several existing
R packages and is working on the release of some new packages that focus on machine
learning and bioinformatics. In late 2013, he joined the Microsoft Research-University of
Trento COSBI in Italy as a researcher. He uses R as the backend engine for developing
various utilities and machine learning methods to address problems in bioinformatics.
Successful work is a fruitful culmination of efforts by many people. I
would like to hereby express my sincere gratitude to everyone who has
played a role in making this effort a successful one. First and foremost, I
wish to thank David Chiu and Chris Beeley for reviewing the book.
Their feedback, in terms of criticism and comments, was significant in
bringing improvements to the book and its content. I sincerely thank
Kevin Colaco and Ruchita Bhansali at Packt Publishing for their effort as
editors. Their cooperation was instrumental in bringing out the book. I
appreciate and acknowledge Binny K. Babu and the rest of the team at
Packt Publishing, who have been very professional, understanding, and
helpful throughout the project. Finally, I would like to thank my parents,
brother, and sister for their encouragement and appreciation and the pride
they take in my work, despite of not being sure of what Im doing. I
thank them all. I dedicate the work to Yashi, J ayita, and Ahaan.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Bioinformatics with R Cookbook
In recent years, there have been significant advances in genomics and molecular biology
techniques, giving rise to a data boom in the field. Interpreting this huge data in a
systematic manner is a challenging task and requires the development of new
computational tools, thus bringing an exciting, new perspective to areas such as statistical
data analysis, data mining, and machine learning. R, which has been a favorite tool of
statisticians, has become a widely used software tool in the bioinformatics community.
This is mainly due to its flexibility, data handling and modeling capabilities, and most
importantly, due to it being free of cost.
R is a free and robust statistical programming environment. It is a powerful tool for
statistics, statistical programming, and visualizations; it is prominently used for statistical
analysis. It has evolved from S, developed by J ohn Chambers at Bell Labs, which is a
birthplace of many programming languages including C. Ross Ihaka and Robert
Gentleman developed R in the early 1990s.
Roughly around the same time, bioinformatics was emerging as a scientific discipline
because of the advent of technological innovations such as sequencing, high throughput
screening, and microarrays that revolutionized biology. These techniques could generate
the entire genomic sequence of organisms; microarrays could measure thousands of
mRNAs, and so on. All this brought a paradigm shift in biology from a small data
discipline to one big data discipline, which is continuing till date. The challenges posed
by this data shoot-up initially compelled researchers to adopt whatever tools were
available at their disposal. Till this time, R was in its initial days and was popular among
statisticians. However, following the need and the competence of R during the late 90s
(and the following decades), it started gaining popularity in the field of computational
biology and bioinformatics.
The structure of the R environment is a base program that provides basic programming
functionalities. These functionalities can be extended with smaller specialized program
modules called packages or libraries. This modular structure empowers R to unify most
of the data analysis tasks in one program. Furthermore, as it is a command-line
environment, the prerequisite programming skill is minimal; nevertheless, it requires
some programming experience.
This book presents various data analysis operations for bioinformatics and computational
biology using R. With this book in hand, we will solve many interesting problems related
to the analysis of biological data coming from different experiments. In almost every
chapter, we have interesting visualizations that can be used to present the results.
Now, let's look at a conceptual roadmap organization of the book.



For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

What This Book Covers
Chapter 1, Starting Bioinformatics with R, marks the beginning of the book with some
groundwork in R. The major topics include package installation, data handling, and
manipulations. The chapter is further extended with some recipes for a literature search,
which is usually the first step in any (especially biomedical) research.
Chapter 2, Introduction to Bioconductor, presents some recipes to solve basic
bioinformatics problems, especially the ones related to metadata in biology, with the
packages available in Bioconductor. The chapter solves the issues related to ID
conversions and functional enrichment of genes and proteins.
Chapter 3, Sequence Analysis with R, mainly deals with the sequence data in terms of
characters. The recipes cover the retrieval of sequence data, sequence alignment, and
pattern search in the sequences.
Chapter 4, Protein Structure Analysis with R, illustrates how to work with proteins at
sequential and structural levels. Here, we cover important aspects and methods of protein
bioinformatics, such as sequence and structure analysis. The recipes include protein
sequence analysis, domain annotations, protein structural property analysis, and so on.
Chapter 5, Analyzing Microarray Data with R, starts with recipes to read and load the
microarray data, followed by its preprocessing, filtering, mining, and functional
enrichment. Finally, we introduce a co-expression network as a way to map relations
among genes in this chapter.
Chapter 6, Analyzing GWAS Data, talks about analyzing the GWAS data in order to
make biological inferences. The chapter also covers multiple association analyses as well
as CNV data.
Chapter 7, Analyzing Mass Spectrometry Data, deals with various aspects of analyzing
the mass spectrometry data. Issues related to reading different data formats, followed by
analysis and quantifications, have been included in this chapter.
Chapter 8, Analyzing NGS Data, illustrates various next generation sequencing data. The
recipes in this chapter deal with NGS data processing, RNAseq, ChipSeq, and
methylation data.
Chapter 9, Machine Learning in Bioinformatics, discusses recipes related to machine
learning in bioinformatics. We attempt to reach the issues of clustering classification and
Bayesian learning in this chapter to infer from the biological data.
Appendix A, Useful Operators and Functions in R, contains some useful general
functions in R to perform various generic and non-generic operations.
Appendix B, Useful R Packages, contains a list and description of some interesting
libraries that contain utilities for different types of analysis and visualizations.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

1
Starting Bioinformatics
with R
In this chapter, we will cover the following recipes:
Getting started and installing libraries
Reading and writing data
Filtering and subsetting data
Basic statistical operations on data
Generating probability distributions
Performing statistical tests on data
Visualizing data
Working with PubMed in R
Retrieving data from BioMart
Introduction
Recent developments in molecular biology, such as high throughput array technology or
sequencing technology, are leading to an exponential increase in the volume of data that
is being generated. Bioinformatics aims to get an insight into biological functioning and the
organization of a living system riding on this data. The enormous data generated needs
robust statistical handling, which in turn requires a sound computational statistics tool
and environment. R provides just that kind of environment. It is a free tool with a large
community and leverages the analysis of data via its huge package libraries that support
various analysis operations.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
8
Before we start dealing with bioinformatics, this chapter lays the groundwork for upcoming
chapters. We rst make sure that you know how to install R, followed by a few sections
on the basics of R that will rejuvenate and churn up your memories and knowledge on R
programming that we assume you already have. This part of the book will mostly introduce
you to certain functions in R that will be useful in the upcoming chapters, without getting
into the technical details. The latter part of the chapter (the last two recipes) will introduce
Bioinformatics with respect to literature searching and data retrieval in the biomedical arena.
Here, we will also discuss the technical details of the R programs used.
Getting started and installing libraries
Libraries in R are packages that have functions written to serve specic purposes; these
include reading specic le formats in the case of a microarray datale or fetching data from
certain databases, for example, GenBank (a sequence database). You must have these
libraries installed in the system as well as loaded in the R session in order to be able to use
them. They can be downloaded and installed from a specic repository or directly from a
local path. Two of the most popular repositories of R packages are Comprehensive R Archive
Network (CRAN) and Bioconductor. CRAN maintains and hosts identical, up-to-date versions
of code and documentation for R on its mirror sites. We can use the install.packages
function to install a package from CRAN that has many mirror locations. Bioconductor is
another repository of R and the associated tool with a focus on other tools for the analysis
of high throughput data. A detailed description on how to work with Bioconductor
(http://www.bioconductor.org) is covered in the next chapter.
This recipe aims to explain the steps involved in installing packages/libraries as well as local
les from these repositories.
Getting ready
To get started, the prerequisites are as follows:
You need an R application installed on your computer. For more details on the R
program and its installation, visit http://cran.r-project.org.
You need an Internet connection to install packages/libraries from web repositories
such as CRAN and Bioconductor.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
9
How to do it
The initialization of R depends on the operating system you are using. On Windows and Mac
OS platforms, just clicking on the program starts an R session, like any other application for
these systems. However, for Linux, R can be started by typing in R into the terminal (for all
Linux distributions, namely, Ubuntu, SUSE Debian, and Red Hat). Note that calling R via its
terminal or command line is also possible in Windows and Mac systems.
This book will mostly use Linux as the operating system; nevertheless, the differences will
be explained whenever required. The same commands can be used for all the platforms,
but the Linux-based R lacks the default graphical user interface (GUI) of R. At this point, it
is worth mentioning some of the code editors and integrated development environments
(IDEs) that can be used to work with R. Some popular IDEs for R include RStudio (http://
www.rstudio.com) and the Eclipse IDE (http://www.eclipse.org) with the StatET
package. To learn more about the StatET package, visit http://www.walware.de/goto/
statet. Some commonly used code editors are Emacs, Kate, Notepad++, and so on. The R
GUI in Windows and Mac has its own code editor that meets all the requirements.
Windows and Mac OS GUIs make installing packages pretty straightforward. Just follow the
ensuing steps:
1. From the Packages menu in the toolbar, select Install package(s)....
2. If this is the rst time that you are installing a package during this session, R will ask
you to pick a mirror. A selection of the nearest mirror (geographically) is more feasible
for a faster download.
3. Click on the name of the package that you want to install and then on the OK button.
R downloads and installs the selected packages.
By default, R fetches packages from CRAN. However, you can
change this if necessary just by choosing Select repositories...
from the Packages menu. You are required to change the
default repository or switch the repository in case the desired
package is available in a different repository. Remember that a
change in the repository is different from a change in the mirror;
a mirror is the same repository at a different location.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
10
The following screenshot shows how to set up a repository for a package installation
in the R GUI for Windows:
4. Install an R package in one of the following ways:
From a terminal, install it with the following simple command:
> install.packages("package_name")
From a local directory, install it by setting the repository to null as follows:
> install.packages("path/to/mypackage.tar.gz", repos = NULL,
type="source")
Another way to install packages in Unix (Linux) is without entering R (from the
source) itself. This can be achieved by entering the following command in the
shell terminal:
R CMD INSTALL path/to/mypackage.tar.gz


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
11
5. To check the installed libraries/packages in R, type the following command:
> library()
6. To quit an R session, type in q() at the R prompt, and the session will ask whether
you want to save the session as a workspace image or not or whether you want to
cancel the quit command. Accordingly, you need to type in y, n, or c. In a Windows
or Mac OS, you can directly close the R program like any other application.
> q()
Save workspace image [y/n/c]: n
Downloading the example code
You can download the example code files for all Packt books
that you have purchased from your account at http://www.
packtpub.com. If you purchased this book from elsewhere, you
can visit http://www.packtpub.com/support and register
to have the files e-mailed directly to you.
How it works...
An R session can run as a GUI on a Windows or Mac OS platform (as shown in the following
screenshot). In the case of Linux, the R session starts in the same terminal. Nevertheless,
you can run R within the terminal in Windows as well as Mac OS:
The R GUI in Mac OS showing the command window (right), editor (top left), and plot window (bottom left)


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
12
The install.packages command asks the user to choose a mirror (usually the nearest) for
the repository. It also checks for the dependencies required for the package being installed,
provided we set the dependencies argument to TRUE. Then, it downloads the binaries
(Windows and Mac OS) for the package (and the dependencies, if required). This is followed
by its installation. The function also checks the compatibility of the package with R, as on
occasions, the library cannot be loaded or installed due to an incorrect version or missing
dependencies. In such cases, the installed packages are revoked. Installing from the source
is required in cases where you have to compile binaries for your own machine in terms of
the R version or so. The availability of binaries for the package makes installation easier for
naive users. The lenames of the package binaries have a .tgz/.zip extension. The value
of repos can be set to any remote source address for a specic remote source. On Windows,
however, the function is also encoded in terms of a GUI that graphically and interactively
shows the list of binary versions of the packages available for your R version. Nevertheless,
the command-line installation is also functional on the Windows version of R.
There's more...
A few libraries are loaded by default when an R session starts. To load a library in R, run the
following command:
> load(package_name)
Loading a package imports all the functions of this specic package into the R session. The
default packages in the session can be viewed using the following getOption command:
> getOption("defaultPackages")
The currently loaded libraries in a session can be seen with the following command:
> print(.packages())
An alternative for this is sessionInfo(), which provides version details as well.
All the installed packages can be displayed by running the library function as follows:
> library()
Besides all this, R has a comprehensive built-in help system. You can get help from R in a
number of ways. The Windows and Mac OS platforms offer help as a separate HTML page
(as shown in the following screenshot) and Linux offers similar help text in the running
terminal. The following is a list of options that can be used to seek help in R:
> help.start()
> help(sum) # Accesses help file for function sum
> ?sum # Searches the help files for function sum
> example(sum) # demonstrates the function with an example
> help.search("sum") # uses the argument character to search help files


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
13
All of the previous functions provide help in a unique way. The help.start command is the
general command used to start the hypertext version of the R documentation. All the help les
related to the package can be checked with the following command:
> help(package="package_name")
The following screenshot shows an HTML help page for the sum function in R:
Reading and writing data
Before we start with analyzing any data, we must load it into our R workspace. This can
be done directly either by loading an external R object (typical le extensions are .rda or
.RData, but it is not limited to these extensions) or an internal R object for a package or a
TXT, CSV, or Excel le. This recipe explains the methods that can be used to read data from
a table or the .csv format and/or write similar les into an R session.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
14
Getting ready
We will use an iris dataset for this recipe, which is available with R Base packages.
The dataset bears quantied features of the morphologic variation of the three
related species of Iris owers.
How to do it
Perform the following steps to read and write functions in R:
1. Load internal R data (already available with a package or base R) using the following
data function:
> data(iris)
2. To learn more about iris data, check the help function in R using the following function:
> ?iris
3. Load external R data (conventionally saved as .rda or .RData, but not limited to
this) with the following load function:
> load(file="mydata.RData")
4. To save a data object, say, D, you can use the save function as follows:
> save(D, file="myData.RData")
5. To read the tabular data in the form of a .csv le with read.csv or read.table,
type the following command:
> mydata <- read.table("file.dat", header = TRUE, sep="\t", row.
names = 1)
> mydata <- read.csv("mydata.csv")
6. It is also possible to read an Excel le in R. You can achieve this with various
packages such as xlsx and gdata. The xlsx package requires Java settings, while
gdata is relatively simple. However, the xlsx package offers more functionalities,
such as read permissions for different sheets in a workbook and the newer versions
of Excel les. For this example, we will use the xlsx package. Use the read.xlsx
function to read an Excel le as follows:
> install.packages("xlsx", dependencies=TRUE)
> library(gdata)
> mydata <- read.xls("mydata.xls")


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
15
7. To write these data frames or table objects into a CSV or table le, use the read.csv
or write.table function as follows:
> write.table(x, file = "myexcel.xls", append = FALSE, quote =
TRUE, sep = " ")
> write.csv(x, col.names = NA, sep = ",")
How it works
The read.csv or write.csv commands take the lename in the current working
directoryif a complete path has not been speciedand based on the separators (usually
the sep argument), import the data frames (or export them in case of write commands).
To nd out the current working directory, use the getwd() command. In order to change it
to your desired directory, use the setwd function as follows:
> setwd("path/to desired/directory")
The second argument header indicates whether or not the rst row is a set of labels by taking
the Boolean values TRUE or FALSE. The read.csv function may not work in the case of
incomplete tables with the default argument fill. To overcome such issues, use the value,
TRUE for the fill argument. To learn more about optional arguments, take a look at the help
section of the read.table function. Both the functions (read.table and read.csv) can
use the headers (usually the rst row) as column names and specify certain column numbers
as row names.
There's more
To get further information about the loaded dataset, use the class function for the dataset
to get the type of dataset (object class). The data or object type in R can be of numerous
types. This is beyond the scope of the book. It is expected that the reader is acquainted with
these terms. Here, in the case of the iris data, the type is a data frame with 150 rows and
ve columns (type the dim command with iris as the argument). A data frame class is
like a matrix but can accommodate objects of different types, such as character, numeric,
and factor, within it. You can take a look at the rst or last few rows using the head or tail
functions (there are six rows by default) respectively, as follows:
> class(iris)
> dim(iris)
> head(iris)
> tail(iris)


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
16
The following WriteXLS package allows us to write an object into an Excel le for the
x data object:
> install.packages(WriteXLS)
> library(WriteXLS)
> WriteXLS(x, ExcelFileName = "R.xls")
The package also allows us to write a list of data frames into the different sheets of an
Excel le. The WriteXLS function uses Perl in the background to carry out tasks. The sheet
argument can be set within the function and assigned the sheet number where you want
to write the data.
The save function in R is a standard way to save an object. However, the saveRDS
function offers an advantage as it doesn't save both the object and its name; it just saves a
representation of the object. As a result, the saved object can be loaded into a named object
within R that will be different from the name it had when it was originally serialized. Let's take
a look at the following example:
> saveRDS(myObj, "myObj.rds")
> myObj2 <- readRDS("myObj.rds")
> ls()
[1] "myObj" "myObj2"
Another package named data.table can be used to perform data reading at a faster speed,
which is especially suited for larger data. To know more about the package, visit the CRAN
page for the package at http://cran.r-project.org/web/packages/data.table/
index.html.
The foreign package (http://cran.r-project.org/web/packages/foreign/
index.html) is available to read/write data for other programs such as SPSS and SAS.
Filtering and subsetting data
The data that we read in our previous recipes exists in R as data frames. Data frames are
the primary structures of tabular data in R. By a tabular structure, we mean the row-column
format. The data we store in the columns of a data frame can be of various types, such as
numeric or factor. In this recipe, we will talk about some simple operations on data to extract
parts of these data frames, add a new chunk, or lter a part that satises certain conditions.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
17
Getting ready
The following items are needed for this recipe:
A data frame loaded to be modied or ltered in the R session (in our case,
the iris data)
Another set of data to be added to item 1 or a set of lters to be extracted from
item 1
How to do it
Perform the following steps to lter and create a subset from a data frame:
1. Load the iris data as explained in the earlier recipe.
2. To extract the names of the species and corresponding sepal dimensions
(length and width), take a look at the structure of the data as follows:
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1
1 1 1 1 1 1 1 1 ...
3. To extract the relevant data to the myiris object, use the data.frame function that
creates a data frame with the dened columns as follows:
> myiris=data.frame(Sepal.Length=iris$Sepal.Length, Sepal.Width=
iris$Sepal.Width, Species= iris$Species)
4. Alternatively, extract the relevant columns or remove the irrelevant ones
(however, this style of subsetting should be avoided):
> myiris <- iris[,c(1,2,5)]
5. Instead of the two previous methods, you can also use the removal approach to
extract the data as follows:
> myiris <- iris[,-c(3,4)]
6. You can add to the data by adding a new column with cbind or a new row through
rbind (the rnorm function generates a random sample from a normal distribution
and will be discussed in detail in the next recipe):
> Stalk.Length <-c (rnorm(30,1,0.1),rnorm(30,1.3,0.1), rnorm(30,1.
5,0.1),rnorm(30,1.8,0.1), rnorm(30,2,0.1))
> myiris <- cbind(iris, Stalk.Length)


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
18
7. Alternatively, you can do it in one step as follows:
> myiris$Stalk.Length = c(rnorm(30,1,0.1),rnorm(30,1.3,0.1), rnorm
(30,1.5,0.1),rnorm(30,1.8,0.1), rnorm(30,2,0.1))
8. Check the new data frame using the following commands:
> dim(myiris)
[1] 150 6
> colnames(myiris)# get column names for the data frame myiris
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
"Species" "Stalk.Length"
9. Use rbind as depicted:
newdat <- data.frame(Sepal.Length=10.1, Sepal.Width=0.5, Petal.
Length=2.5, Petal.Width=0.9, Species="myspecies")
> myiris <- rbind(iris, newdat)
> dim(myiris)
[1] 151 5
> myiris[151,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
151 10.1 0.5 2.5 0.9 myspecies
10. Extract a part from the data frame, which meets certain conditions, in one of the
following ways:
One of the conditions is as follows:
> mynew.iris <- subset(myiris, Sepal.Length == 10.1)
An alternative condition is as follows:
> mynew.iris <- myiris[myiris$Sepal.Length == 10.1, ]
> mynew.iris
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species
151 10.1 0.5 2.5 0.9
myspecies
> mynew.iris <- subset(iris, Species == "setosa")
11. Check the following rst row of the extracted data:
> mynew.iris[1,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
You can use any comparative operator as well as even combine more than one
condition with logical operators such as & (AND), | (OR), and ! (NOT), if required.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
19
How it works
These functions use R indexing with named columns (the $ sign) or index numbers. The $
sign placed after the data followed by the column name species the data in that column. The
R indexing system for data frames is very simple, just like other scripting languages, and is
represented as [rows, columns]. You can represent several indices for rows and columns using
the c operator as implemented in the following example. A minus sign on the indices for rows/
columns removes these parts of the data. The rbind function used earlier combines the data
along the rows (row-wise), whereas cbind does the same along the columns (column-wise).
There's more
Another way to select part of the data is using %in% operators with the data frame, as follows:
> mylength <- c(4,5,6,7,7.2)
> mynew.iris <- myiris[myiris[,1] %in% mylength,]
This selects all the rows from the data that meet the dened condition. The condition here
means that the value in column 1 of myiris is the same as (matching) any value in the
mylength vector. The extracted rows are then assigned to a new object, mynew.iris.
Basic statistical operations on data
R being a statistical programming environment has a number of built-in functionalities to
perform statistics on data. Nevertheless, some specic functionalities are either available in
packages or can easily be written. This section will introduce some basic built-in and useful
in-package options.
Getting ready
The only prerequisite for this recipe is the dataset that you want to work with. We use our iris
data in most of the recipes in this chapter.
How to do it
The steps to perform a basic statistical operation on the data are listed here as follows:
1. R facilitates the computing of various kinds of statistical parameters, such as mean
standard deviation, with a simple function. This can be applied on individual vectors
or on an entire data frame as follows:
> summary(iris) # Shows a summary for each column for table data
Sepal.Length Sepal.Width Petal.Length Petal.Width


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
20
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
> mean(iris[,1])
[1] 5.843333
> sd(iris[,1])
[1] 0.8280661
2. The cor function allows for the computing of the correlation between two vectors
as follows:
> cor(iris[,1], iris[,2])
[1] -0.1175698
> cor(iris[,1], iris[,3])
[1] 0.8717538
3. To get the covariance for the data matrix, simply use the cov function as follows:
> Cov.mat <- cov(iris[,1:4])
> Cov.mat
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
21
How it works
Most of the functions we saw in this recipe are part of basic R or generic functions. The
summary function in R provides the summaries of the input depending on the class of the
input. The function invokes various functions depending on the class of the input object.
The returned value also depends on the input object. For instance, if the input is a vector
that consists of numeric data, it will present the mean, median, minimum, maximum, and
quartiles for the data, whereas if the input is tabular (numeric) data, it will give similar
computations for each column. We will use the summary function in upcoming chapters
for different types of input objects.
The functions accept the data as input and simply compute all these statistical scores on
them, displaying them as vector, list, or data frame depending on the input and the function.
For most of these functions, we have the possibility of using the na.rm argument. This
empowers the user to work with missing data. If we have missing values (called NA in R) in our
data, we can set the na.rm argument to TRUE, and the computation will be done only based
on non-NA values. Take a look at the following chunk for an example:
> a <- c(1:4, NA, 6)
> mean(a) # returns NA
[1] NA
> mean(a, na.rm=TRUE)
[1] 3.2
We see here that in the case of missing values, the mean function returns NA by default as it
does not know how to handle the missing value. Setting na.rm to TRUE actually computes the
mean of ve numbers (1, 2, 3, 4, and 6) in place of 6 (1, 2, 3, 4, NA, and 6),
returning 3.2.
To compute the correlation between the sepal length and sepal width in our iris data, we
simply use the cor function with the two columns (sepal length and sepal width) as the
arguments for the function. We can compute the different types of correlation coefcients,
namely Pearson, Spearman, Kendall, and so on, by specifying the apt value for the method
arguments in the function. For more details, refer to the help (?cor) function.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
22
Generating probability distributions
Before we talk about anything in this section, try the ?Distributions function in your R
terminal (console). You will see that a help page consisting of different probability distributions
opens up. These are part of the base package of R. You can generate all these distributions
without the aid of additional packages. Some interesting distributions are listed in the
following table. Other distributi ons, for example, multivariate normal distribution (MVN),
can be generated by the use of external packages (MASS packages for MVN). Most of
these functions follow the same syntax, so if you get used to one, others can be achieved
in a similar fashion.
In addition to this simple process, you can generate different aspects of the distribution just
by adding some prexes.
How to do it
The following are the steps to generate probability distributions:
1. To generate 100 instances of normally distributed data with a mean equal to 1 and
standard deviation equal to 0.1, use the following command:
> n.data <- rnorm(n=100, mean=1, sd=0.1)
2. Plot the histogram to observe the distribution as follows:
> hist(n.data)
3. Check the density of the distribution and observe the shape by typing the
following command:
> plot(density(n.data))
Do you see the bell shape in this plot?
4. To identify the corresponding parameters for other prexes, use the following help
le example:
> ?pnorm
The following table depicts the functions that deal with various statistical distributions in R
(R Base packages only):
Distribution Probability Quantile Density Random
Beta pbeta qbeta dbeta rbeta
Binomial pbinom qbinom dbinom rbinom
Cauchy pcauchy qcauchy dcauchy rcauchy
Chi-Square pchisq qchisq dchisq rchisq


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
23
Distribution Probability Quantile Density Random
Exponential pexp qexp dexp rexp
F pf qf df rf
Gamma pgamma qgamma dgamma rgamma
Geometric pgeom qgeom dgeom rgeom
Hypergeometric phyper qhyper dhyper rhyper
Logistic plogis qlogis dlogis rlogis
Log Normal plnorm qlnorm dlnorm rlnorm
Negative Binomial pnbinom qnbinom dnbinom rnbinom
Normal pnorm qnorm dnorm rnorm
Poisson ppois qpois dpois rpois
Student t pt qt dt rt
Studentized Range ptukey qtukey dtukey rtukey
Uniform punif qunif dunif runif
How it works
The rnorm function has three arguments: n (the number of instances you want to generate),
the desired mean of the distribution, and the desired standard deviation (sd) in the
distribution. The command thus generates a vector of length n, whose mean and standard
deviations are as dened by you. If you look closely at the functions described in the table,
you can gure out a pattern. The prexes p, q, d, and r are added to every distribution name
to generate probability, quintiles, density, and random samples, respectively.
There's more
To learn more about statistical distribution, visit the Wolfram page at
http://www.wolframalpha.com/examples/StatisticalDistributions.html.
Performing statistical tests on data
Statistical tests are performed to assess the signicance of results in research or application
and assist in making quantitative decisions. The idea is to determine whether there is
enough evidence to reject a conjecture about the results. In-built functions in R allow
several such tests on data. The choice of test depends on the data and the question being
asked. To illustrate, when we need to compare a group against a hypothetical value and our
measurements follow the Gaussian distribution, we can use a one-sample t-test. However, if
we have two paired groups (both measurements that follow the Gaussian distribution) being
compared, we can use a paired t-test. R has built-in functions to carry out such tests, and in
this recipe, we will try out some of these.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
24
How to do it
Use the following steps to perform a statistical test on your data:
1. To do a t-test, load your data (in our case, it is the sleep data) as follows:
> data(sleep)
2. To perform the two-sided, unpaired t-test on the rst and second columns (the values
for the two conditions), type the following commands:
> test <- t.test(sleep[,1]~sleep[,2])
> test
Welch Two Sample t-test
data: sleep[, 1] by sleep[, 2]
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.3654832 0.2054832
sample estimates:
mean in group 1 mean in group 2
0.75 2.33
3. Create a contingency table as follows:
> cont <- matrix(c(14, 33, 7, 3), ncol = 2)
> cont
[,1] [,2]
[1,] 14 7
[2,] 33 3
4. Create a table that represents two types of cars, namely, sedan and convertible
(columns) and two genders, male and female, and a count of these that own the
types of cars along the rows. Thus, you have the following output:
> colnames(cont) <- c("Sedan", "Convertible")
> rownames(cont) <- c("Male", "Female")
> cont
Sedan Convertible
Male 14 7
Female 33 3


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
25
5. In order to nd the car type and gender, carry out a Chi-square test based on this
contingency table as follows:
> test <- chisq.test(as.table(cont))
> test
Pearson's Chi-squared test with Yates' continuity correction
data: as.table(cont)
X-squared = 4.1324, df = 1, p-value = 0.04207
6. For a Wilcoxon signed-rank test, rst create a set of vectors containing observations
to be tested as x and y, as shown in the following commands:
> x <- c(1.83, 0.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30)
> y <- c(0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29)
7. This is simply followed by a command that you need to execute to run the Wilcoxon
signed-rank test as follows:
> test <- wilcox.test(x, y, paired = TRUE, alternative =
"greater")
8. To look at the contents of the object test, check the structures as follows and look at
the specic values of the components:
> str(test)
> test$p.value
How it works
The t-test (in our case, it is two sample t-tests) computes how the calculated mean may
deviate from the real mean by chance. Here, we use the sleep data that already exists in R.
This sleep data shows the effect of two drugs in terms of an increase in the hours of sleep
compared to the sleep data of 10 control patients. The result is a list that consists of nine
elements, such as p-value, condence interval, method, and mean estimates.
Chi-square statistics investigate whether the distributions of the categorical variables differ
from one another. It is commonly used to compare observed data with the data that we
would expect to obtain according to a specic hypothesis. In this recipe, we considered the
scenario that one gender has a different preference for a car, which comes out to true at a
p-value cutoff at 0.05. We can also check the expected values for the Chi-square test with
the chisq.test(as.table(cont))$expected function.
The Wilcoxon test is used to compare two related samples or repeated measurements on
a single sample, to assess if their population mean ranks differ. It can be used to compare
the results of two methods. Let x and y be the performance results of two methods, and our
alternative hypothesis is that x is shifted to the right of y (greater). The p-value returned by the
test facilitates the acceptance or rejection of the null hypothesis.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
26
There's more
There are certain other tests, such as the permutation test, Kolmogorov-Smirnov test, and
so on, that can be done with R using different functions for appropriate datasets. A few more
tests will be discussed in later chapters. To learn more about statistical tests, you can refer to
a brief tutorial at http://udel.edu/~mcdonald/statbigchart.html.
Visualizing data
Data is more intuitive to comprehend if visualized in a graphical format rather than in the form
of a table, matrix, text, or numbers. For example, if we want to visualize how the sepal length
in the Iris ower varies with the petal length, we can plot along the x and y axes, respectively,
and visualize the trend or even the correlation (scatter plot).In this recipe, we look at some
common way of visualizing data in R and plotting functions with R Base graphics functions.
We also discuss the basic plotting functions. These plotting functions can be manipulated in
many ways, but discussing them is beyond the scope of this book. To get to know more about
all the possible arguments, refer to the corresponding help les.
Getting ready
The only item we need ready here is the dataset (in this recipe, we use the iris dataset).
How to do it
The following are the steps for some basic graph visualizations in R:
1. To create a scatter plot, start with your iris dataset. What you want to see is the
variation of the sepal length and petal length. You need a plot of the sepal length
(column 1) along the y axis and the petal length (column 4) along the x axis, as
shown in the following commands:
> sl <- iris[,1]
> pl <- iris[,4]
> plot(x=pl, y=sl, xlab="Petal length", ylab="Sepal length",
col="black", main="Varition of sepal length with petal length")
Or alternatively, we can use the following command:
> plot(with(iris, plot(x = Sepal.Length, y=Petal.Length))
2. To create a boxplot for the data, use the boxplot function in the following way:
> boxplot(Sepal.Length~Species, data=iris, ylab="sepal length",
xlab="Species", main="Sepal length for different species")


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
27
3. Plotting a line diagram, however, is the same as plotting a scatter plot; just introduce
another argument type into it and set it to 'l'. However, we use a different,
self-created dataset to depict this as follows:
> genex <- c(rnorm(100, 1, 0.1), rnorm(100, 2, 0.1), rnorm(50, 3,
0.1))
> plot(x=genex, xlim=c(1,5), type='l', main="line diagram")
Plotting in R: (A) Scatter plot, (B) Boxplot, (C) Line diagram, and (D) Histogram


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
28
4. Histograms can used to visualize the density of the data and the frequency of every
bin/category. Plotting histograms in R is pretty simple; use the following commands:
> x <- rnorm(1000, 3, 0.02)
> hist(x)
How it works
The plot function extracts the relevant data from the original dataset with the column
numbers (sl and pl, respectively, for the sepal length and petal length) and then plots a
scatter plot. The plot function then plots the sepal length along the y axis and the petal
length along the x axis. The axis labels can be assigned with the argument for xlab and
ylab, respectively, and the plot can be given a title with the main argument. The plot (in the
A section of the previous screenshot) thus shows that the two variables follow a more or less
positive correlation.
Scatter plots are not useful if one has to look for a trend, that is, for how a value is evolving
along the indices, which can prove that it is time for a dynamic process. For example, the
expression of a gene along time or along the concentration of a drug. A line diagram is a
better way to show this. Here, we rst generate a set of 250 articial values and their indices,
which are the values on the x scale. For these values, we assume a normal distribution,
as we saw in the previous section. This is then plotted (as shown in the B section of the
previous screenshot). It is possible to add more lines to the same plot using the line
function as follows:
> lines(density(x), col="red")
A boxplot can be an interesting visualization if we want to compare two categories or groups
in terms of their attributes that are measured in terms of numbers. They depict groups of
numerical data through their quartiles. To illustrate, let us consider the iris data again. We
have the name of the species in this data (column 5). Now, we want to compare the sepal
length of these species with each other, such as which one has the longest sepal and how the
sepal length varies within and between species. The data table has all this information, but it
is not readily observable.
The boxplot function has the rst argument that sorts out what to plot and what to plot
against. This can be given in terms of the column names of the data frame that is the second
argument. Other arguments are the same as other plot functions. The resulting plot (as
shown in the C section of the previous screenshot,) shows three boxes along the x axis for the
three species in our data. Each of these boxes depicts the range quartiles and median of the
corresponding sepal lengths.
The histogram (the D section of the previous screenshot) describes the distribution of data.
As we see, the data is normally distributed with a mean of 3; therefore, the plot displays a bell
shape with a peak of around 3. To see the bell shape, try the plot(density(x)) function.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
29
There's more
You can use the plot function for an entire data frame (try doing this for the iris dataset,
plot(iris)). You will observe a set of pair-wise plots like a matrix. Beside this, there are
many other packages available in R for different high-quality plots such as ggplot2, and plotrix.
They will be discussed in the next chapters when needed. This section was just an attempt to
introduce the simple plot functions in R.
Working with PubMed in R
Research begins with a survey of the related works in the eld. This can be achieved by
looking into the literature available. PubMed is a service that provides the option to look
into the literature. The service has been provided by NCBI-Entrez databases (shown in the
following screenshot) and is available at https://www.ncbi.nlm.nih.gov. R provides an
interface to look into the various aspects of the literature via PubMed. This section provides a
protocol to handle this sort of interface. This recipe allows the searching, storing, and mining,
and quantication meta-analysis within the R program itself, without the need to visit the
PubMed page every time, thus aiding in analysis automation. The following screenshot
shows the PubMed web page for queries and retrieval:


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
30
Getting ready
It's time to get practical with what we learned so far. For all the sessions throughout this
book, we will use the Linux terminal. Let's start at the point where it begins, by getting into
the bibliographic data. The RISmed package facilitates the analyses of NCBI database
content, written and maintained by Kovalchik. The following are the requirements to work
with PubMed in R:
An Internet connection to access the PubMed system
An RISmed package installed and loaded in the R session; this can be done easily
with the following chunk of code:
> install.packages("RISmed")
> library(RISmed)
To look into the various functionalities, you can use the following help function of R:
> help(package="RISmed")
How to do it
The following steps illustrate how to search and retrieve literature from PubMed using R:
1. Load the default data in RISmed. The default data available with the package is for
myeloma. Start by loading this data as follows:
> data(myeloma)
2. Now, nd the myeloma object that was loaded in your R workspace with the ls()
command as follows (note that you might see some other objects as well besides
the myeloma):
> ls()
[1]myeloma
3. To see the contents of the myeloma object, use the following command:
> str(myeloma)
4. Take a look at each element of the data using RISmed, which has the following
specic functions:
> AbstractText(myeloma)
> Author(myeloma)
> ArticleTitle(myeloma)
> Title(myeloma)
> PMID(myeloma)


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
31
5. Create your customized query. What we had till now for RISmed was based on a
precompiled data object for the package. It will be interesting to discuss how we can
create a similar object (for instance, cancer) with a query of our choice. The function
that facilitates the data retrieval and creation of a RISmed class is as follows:
> cancer <- EUtilsSummary("cancer[ti]", type="research",
db="pubmed")
> class(cancer)
How it works
Before we go deep into the functioning of the package, it's important to know about
E-utilities. RISmed uses E-utilities to retrieve data from the Entrez system. In this chapter,
however, our focus is on bibliographic data. E-utilities provide an interface to the Entrez
query and database system. It covers a range of data, including bibliographic, sequences,
and structural data managed by the NCBI. Its functioning is very simple; it sends the query
through a URL to the Entrez system and retrieves the results for the query. This enables
the use of any programming language, such as Perl, Python, or C++, to fetch the XML
response and interpret it. There are several tools that act as a part of E-utilities (for details,
visit http://www.ncbi.nlm.nih.gov/books/NBK25497/). However, for us, Efetch is
interesting. It responds to an input of unique IDs (for example, PMIDs) for a database (in our
case, PubMed) with corresponding data records. The Efetch utility is a data retrieval utility that
fetches data from the Entrez databases in the requested format, with the input in the form of
corresponding IDs. The RISmed package uses it to retrieve literature data.
The rst argument of the EUtilsSummary function is the query term, and in the square
brackets, we have the eld of the query term (in our case, [ti] is for title, [au] is for author,
and so on). The second argument is the E-utility type and the third one refers to the database
(in our case, PubMed). The myeloma object of the RISmed class has information about the
query that was made to fetch the object from PubMed, the PMIDs of the search hits, and
the details of the articles, such as the year of publication, author, titles, abstract and journal
details, and associated mesh terms.
All these commands of the package used with/for myeloma return a list of lengths equal to
the number of hits in the data object (in our case, the myeloma object). Note that the Title
function returns the title of the journal or publisher and not the title of the article, which can
be seen with ArticleTitle.
Now, let's take a look at the structure of the cancer object that we created:
> str(cancer) # As in August 2013
Formal class 'EUtilsSummary' [package "RISmed"] with 5 slots
..@ count : num 575447
..@ retmax : num 1000
..@ retstart : num 0


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
32
..@ id : chr [1:1000] "23901442" "23901427" "23901357" "23901352" ...
..@ querytranslation: chr "cancer[ti]"
> cancer@id[1:10]
[1] "23905205" "23905156" "23905117" "23905066" "23905042" "23905012"
[7] "23904955" "23904921" "23904880" "23904859"
The cancer object consists of ve slots: count, retmax, retstart, id, and
querytranslation. These variables are assigned as the subclasses of the cancer
object. Therefore, in case we need to get the PMIDs of the retrieval, we can do so by
getting values for the id component of the cancer object with the following code:
> cancer@id
One important point that should be noted is that this query retrieved only the rst 1000 hits
out of 575,447 (the default value for retmax). Furthermore, one should follow the policies
for its uses to avoid overloading the E-Utilities server. For further details, read the policy at
http://www.ncbi.nlm.nih.gov/books/NBK25497/.
Now, we are left with the creation of a RISmed object (the same as the myeloma object).
To get this, we use the following EUtilsGet function:
> cancer.ris <- EUtilsGet(cancer, type="efetch", db="pubmed")
> class(cancer.ris)
This new object, cancer.ris, can be used to acquire further details as explained earlier.
For more operations on PubMed, refer to the help le of the RISmed package.
A drawback of the RISmed package is that, in some cases, due to the incorrect parsing of
text, the values returned could be inaccurate. A more detailed explanation of this package
can be found by seeking help for the package as described earlier. To get to know more about
the RISmed package, refer to the CRAN package home page at http://cran.r-project.
org/web/packages/RISmed/RISmed.pdf.
Some interesting applications on the RISmed package are available on the R Psychologist
page at http://rpsychologist.com/an-r-script-to-automatically-look-at-
pubmed-citation-counts-by-year-of-publication/.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
33
Retrieving data from BioMart
So far, we discussed bibliographic data retrieval from PubMed. R also allows the handling of
other kinds of data. Here, we introduce another R package called biomaRt, developed by
Durinck and co-workers. It provides an interface to a collection of databases implementing
the BioMart suite (http://www.biomart.org). It enables the retrieval of data from a
range of BioMart databases, such as Ensembl (genes and genomes), Uniprot (information on
proteins), HGNC (gene nomenclature), Gramene (plant functional genomics), and Wormbase
(information on C. elegans and other nematodes).
Getting ready
The following are the prerequisites:
Install and load the biomaRt library
Create the data ID or names you want to retrieve (usually gene nameswe use
BRCA1 for a demo in this recipe), such as the ID or the chromosomal location
How to do it
Retrieving the gene ID from HGNC involves the following steps, where we rst set the mart
(data source), followed by the retrieval of genes from this mart:
1. Before you start using biomaRt, install the package and load it into the R session.
The package can directly be installed from Bioconductor with the following script.
We discuss more about Bioconductor in the next chapter; for the time being, take
a look at the following installation:
> source("http://bioconductor.org/biocLite.R")
> biocLite("biomaRt")
> library(biomaRt)
2. Select the appropriate mart for retrieval by dening the right database for your query.
Here, you will look for human ensembl genes; hence, run the useMart function
as follows:
> mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_
ensembl")
3. Now, you will get the list of genes from the ensembl data, which you opted for earlier,
as follows:
> my_results <- getBM(attributes = c("hgnc_symbol"), mart = mart)


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Starting Bioinformatics with R
34
4. You can then sample a few genes, say 50, from your retrieved genes as follows:
> N <- 50
> mysample <- sample(my_results$hgnc_symbol,N)
> head(mysample)
[1] "AHSG" "PLXNA4" "SYT12" "COX6CP8" "RFK" "POLR2LP"
5. The biomaRt package can also be used to retrieve sequences from the databases
for a gene, namely "BRCA1", as shown in the following commands:
> seq <- getSequence(id="BRCA1", type="hgnc_symbol",
seqType="peptide", mart = mart)
> show(seq)
6. To retrieve a sequence that species the chromosome position, the range of the
position (upstream and downstream from a site) can be used as well, as follows:
> seq2 <- getSequence(id="ENST00000520540", type='ensembl_
transcript_id',seqType='gene_flank',upstream = 30,mart = mart)
7. To see the sequence, use the show function as follows:
> show(seq2)
gene_flank ensembl_transcript_id
1 AATGAAAAGAGGTCTGCCCGAGCGTGCGAC ENST00000520540
How it works
The source function in R loads a new set of functions in the source le into the R-session;
do not confuse it with a package. Furthermore, during installation, R might ask you to update
the Bioconductor libraries that were already installed. You can choose these libraries as per
your requirements.
The biomaRt package works with the BioMart database as described earlier. It rst
selects the mart of interest (that is why, we have to select our mart for a specic query).
Then, this mart is used to search for the query on the BioMart database. The results are
then returned and formatted for the return value. The package thus provides an interface
for the BioMart system.
Thus, the biomaRt package can search the database and fetch a variety of biological data.
Although the data can be downloaded in a conventional way from its respective database,
biomaRt can be used to bring the element of automation into your workow for bulk and
batch processing.


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Chapter 1
35
There's more
The biomaRt package can also be used to convert one gene ID to other types of IDs. Here,
we illustrate the conversion of RefSeq IDs to gene symbols with the following chunk of code:
> mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
> geneList <- read.csv("mylist.csv")
> results <- getBM(attributes = c("refseq_mrna", "hgnc_symbol"), filters
= "refseq_mrna", values = geneList[,2], mart = mart)
> results
refseq_mrna hgnc_symbol
1 NM_000546 TP53
2 NM_001271003 TFPI2
3 NM_004402 DFFB
4 NM_005359 SMAD4
5 NM_018198 DNAJC11
6 NM_023018 NADK
7 NM_033467 MMEL1
8 NM_178545 TMEM52
Though biomaRt enables the conversion of the ID for biological entities for most of our work,
in this book, we also use some other packages that are handier and will be illustrated in the
next chapter.
See also
The BioMart home page at http://www.biomart.org to know more about BioMart
The BioMart: driving a paradigm change in biological data management article by
Arek Kasprzyk (http://database.oxfordjournals.org/content/2011/
bar049.full)
The BioMart and Bioconductor: a powerful link between biological databases and
microarray data analysis article by Durinck and others (http://bioinformatics.
oxfordjournals.org/content/21/16/3439.long), which discusses the
details of the biomaRt package


For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

Where to buy this book
You can buy Bioinformatics with R Cookbook from the Packt Publishing website:
http://www.packtpub.com/bioinformatics-with-r-cookbook/book.
Free shipping to the US, UK, Europe and selected Asian countries. For more information, please
read our shipping policy.
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and
most internet book retailers.




















www.PacktPub.com



For More Information:
www.packtpub.com/bioinformatics-with-r-cookbook/book

You might also like