0% found this document useful (0 votes)
41 views

Big Data and Genomics

The document discusses using big data technologies like Hadoop, MapReduce, and cloud computing to analyze large genomic datasets. It notes that advances in sequencing have led to a massive increase in biological data that exceeds the capabilities of existing algorithms. Distributed frameworks are now commonly used to process this "big data" in genomics. The benefits of these approaches include scalability, cost efficiency, and enabling new types of analyses that can lead to personalized medicine and other applications. Specific software discussed includes Hadoop, MapReduce, Microsoft Azure, and related tools.

Uploaded by

Andreea Vezeteu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Big Data and Genomics

The document discusses using big data technologies like Hadoop, MapReduce, and cloud computing to analyze large genomic datasets. It notes that advances in sequencing have led to a massive increase in biological data that exceeds the capabilities of existing algorithms. Distributed frameworks are now commonly used to process this "big data" in genomics. The benefits of these approaches include scalability, cost efficiency, and enabling new types of analyses that can lead to personalized medicine and other applications. Specific software discussed includes Hadoop, MapReduce, Microsoft Azure, and related tools.

Uploaded by

Andreea Vezeteu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Using big data for genomic medicine

Fodor,Andreea
[Email address]

0
1 Introduction

In recent times, bioinformatics has begun using a distributed approach based on the
MapReduce programming model, mainly because of the significant quantity of informational
data generated by the modern sequencing techniques. Despite this fact, using MapReduce and
related Big Data technologies and plans (for example Apache Hadoop and Spark) is not
generating satisfying results, failing both in efficiency and effectiveness. In this paper I will
elaborate more about the way the development of distributed and Big Data management
technologies has impacted the study of big datasets of sequences. Furthermore, I will display
how playing with the configuration of parameters might be imperative in order to obtain
improved performance, with emphasis on big quantities of data.

1.1 What is the problem

The increase of datasets available has become too accelerated for the existing algorithms that
are often used for the study of biological sequences. Because of this, leaning towards a
computational approach to process big data is becoming more and more popular among
researchers, in order to solve issues related to large amount of data.

Genetic disease diagnosis and risk factors predictions have received a tremendous amount of
help from genome analysis. While one exome gathers as much as 15 GB of data that is encoded
as a FastQ file, a whole genome can reach a size of up to 1TB of data. Advanced variant
discovery and interpretation processes may last up to 10 hours to process one exome,
depending on the type of analysis. Since whole-genome sequencing has become easy to reach
economically for the population, personalized medicine will proportionally require scalable
variant analysis solutions.

For some variations, the variant discovery process is represented by a pipeline where data
flows through a series of thoroughly studied stages, from the initial reads off the sequencing
machine, to a series of interpreted variants by a clinician.

1
In his research paper called “Multiple comparative metagenomics using multiset k-mer
counting”, Gaëtan Benoit raises the concern that the fundamental domain of alignment-free
linguistic and informational analysis of genomic and proteomic sequences has received little
attention in this context. This being mentioned, it is important to state that the collection of the
k-mer statistics is an essential function that stands at the center of this domain, i.e., how many
times each sequence of length k over a finite alphabet appears in a set of biological sequences,
at a genomic scale.

1.2 Why big data is necessary

Building improved models, which generate higher precision results is possible thanks to big
data. Companies have developed innovative approaches in order to market themselves and
increase their sales. Big data also participates in the way human resources are managed, or the
way disasters are responded to, besides many other such applications that prove its importance
in influencing decisions.

Due to genome sequencing technology increase, there is an unprecedented draw in biomedical


big data that the life sciences industry is experiencing. The biomedical data is extracted and
studied, helping in the creation of personalized medicine and other related applications.

Being one of the highest growing big data types, it allows researchers and doctors to prescribe
personalized medicine. Between 100 million and 2 billion human genomes could be sequenced
by year 2025. This sequence data demands for between 2 exabytes and 40 exabytes in data
storage. In comparison, all of YouTube only requires 1 to 2 exabytes a year. An exabyte is 10 to
the power 18 bites. That is, 18 zeros after 40.

The major downside is its increase cost of analysis for such massive volumes, taking up to 10k
trillion CPU hours in total.

Personalized medicine has allowed patients that do not have any specific type or stage of
cancer, go from receiving the same treatment to having unique prescriptions based on their
genomic information. By investing in big data research, companies enable the development of

2
better ways of studying large scale data in order to find solutions tailored for each individual
patient, thus proving the effectiveness.

A big challenge in biomedical big data applications, like many other fields, is how we can
integrate many types of data sources to gain further insight problem.

1.3 What software is needed

For this research we will use Microsoft Azure. HDInsight is an Apache Hadoop implementation
that runs in globally distributed Microsoft datacenters. It’s a service that allows you to easily
build a Hadoop cluster in minutes when you need it, and tear it down after you run your
MapReduce jobs. As Windows Azure Insiders, we believe there are a couple key value
propositions of HDInsight. The first is that it’s 100 percent Apache-based, not a special
Microsoft version, meaning that as Hadoop evolves, Microsoft will embrace the newer versions.
Moreover, Microsoft is a major contributor to the Hadoop/Apache project and has provided a
great deal of its query optimization know-how to the query tooling, Hive.

The second aspect of HDInsight that’s compelling is that it works seamlessly with Windows
Azure Blobs, mechanisms for storing large amounts of unstructured data that can be accessed
from anywhere in the world via HTTP or HTTPS. HDInsight also makes it possible to persist the
meta-data of table definitions in SQL Server so that when the cluster is shut down, you don’t
have to re-create your data models from scratch.

Hadoop, a data processing tool that has revolutionized the world of computer science, being
one of the most used technologies in the big data domain.

MapReduce, Google’s solution for processing big data is also used. Google has been the first to
truly experience the “big data tsunami” while indexing a huge amount of webpage data in a
short amount of time. MapReduce is a Java written software framework, which was created in
order to run over a cluster of machines in a distributed way.

The GFS (Google File system) is a cluster of thousands of computers where data is split into
smaller pieces and distributed. A parallelized programming API called MapReduce is used to

3
scatter the computations to the location of the data (Map) and to gather the results at the end
(Reduce).

Hadoop is used by all the leading companies - Facebook, Twitter. It is an open-source


implementation of Google’s solution, made up of MapReduce and the Hadoop Distributed File
System. It depends on a strategy plan of moving data and processing in order to increase
performance.

A private infrastructure can be used to run Hadoop clusters, although there is a recent increase
in the popularity of public services such as Amazon Elastic MapReduce. This allows users to
efficiently manipulate significant data sets and use various techniques to analyze data, such as
data mining, machine learning and statistical analysis.

A downside for this approach would be that it’s no easy task to program a Hadoop cluster, since
it requires deep understanding of Java for developing parallelized programs.

By using big data platforms and analytics in the genomics domain of life sciences has both the
potential to transform and save lives, making this type of research an ideal use for big data
technologies. They can have a significantly positive impact on humankind and not just
humankind.

4
Fig. 1 depicts the breadth and depth of Hadoop support in the Windows Azure platform.

To connect to the newly created cluster, I have used Putty, a free and open-source terminal
emulator, serial console and network file transfer application. It supports several network
protocols, including SCP, SSH, Telnet, rlogin, and raw socket connection. It can also connect to a
serial port. The connection type for this project is SSH.

1.4 What are the benefits to model the system with big data technology?

An important tool in processing genomic related big data is cloud computing. This technology
offers a scalable and cost-efficient solution. The NIST (National Institute for Standards and
Technology) has stated that Cloud computing is ‘‘a pay-per-use model for enabling convenient,
5
on-demand network access to a shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications and services) that can be rapidly provisioned and
released with minimal management effort or service provider interaction’’. Concepts like
distributed systems, grid computing and parallelized programming do not qualify anymore as
new, virtualization technology is one the first enablers of Cloud.

The business model for cloud has been facilitated to evolve, enabling widespread rollout. An
individual physical machine can now host several virtual machines through the process of
virtualization, thus creating the perfect conditions for maximum hardware utilization. A virtual
machine is an emulation of a computer system. Virtual machines are based on computer
architectures and provide functionality of a physical computer. Their implementations may
involve specialized hardware, software, or a combination.

A Hypervisor, a virtualization management layer, translates the requests from the VM to the
underlying hardware (CPU, memory, hard disks and network connectivity).

2 The methodology

I have mentioned in a previous section about the concept of “k-mer”. A k-mer is a substring of
length K (K>0), and counting the occurences of all such substrings is a central step in many
analyses of DNA sequence data. Counting K-mers for a DNA sequence means finding
frequencies of K-mers for the entire sequence. In bioinformatics, K-mer counting is used for
genome and transcriptome assembly, metagenomic sequencing, and error correction of
sequence reads. Although simple in principle, K-mer counting is a big data challenge, since a
single DNA sample can contain several billion DNA sequences. The K-mer counting problem has
been defined by the Schatz lab.

– Map: input -> key:value pairs


– Shuffle: Group together pairs with same key
– Reduce: key, value-lists -> output

6
2.1 What is the system meant to do?

The system is meant to discover all K-mers for a given k>0 and find the top N k-mers for a given
N>0.

Fig. 2. Map, Shuffle & Reduce All Run in Parallel

2.2 Research questions and how we address these

Decomposing a sequence into its k-mers for analysis allows this set of fixed-size chunks to be
analysed rather than the sequence, and this can be more efficient. K-mers are very useful in
sequence matching (string matching with n-grams has a rich history), and set operations are
faster, easier, and there are a lot of readily-available algorithms and techniques to work with
them. A simple example: to check if a sequence S comes from organism A or from organism B,
assuming the genomes of A and B are known and sufficiently different, we can check if S
contains more k-mers present in A or in B. Yes, there are many tools that do just that.

Basically, using k-mers simplifies bioinformatics to counting and comparing whether things are
there or not.
The question that I will try to answer in this example is: What are the top 10 most frequently
occurring 9-mers in E. coli?
I have accessed sample data to test this k-mer counting solution from
http://bit.ly/e_coli_genome.

7
Conceptually, K-mer counting using MapReduce is similar to the “word count” pro‐ gram, but
since there are no spaces in the human genome, we will count overlapping K-mers instead of
discrete words.

2.3 The architecture

In order to implement an algorithm for k-mer counting, I have created a Hadoop cluster using
Microsoft Azure.
This cluster uses 2 nodes for the Master and 2 nodes for the workers, so a total of 4 nodes.

Fig. 3. Azure HDInsight Cluster configuration

2.4 Diagrams, pseudocode and workflow

The workflow of the k-mer MapReduce is illustrated below:

8
FASTQ file

Filter redundant records

DNA sequences

Map()

(k=k-mer,v=1)

Reduce()

(K=k-mer,v=frequency)

mapPartition() mapPartition() mapPartition()


find local top N find local top N find local top N

Find final top N

Top N

Fig.4 Workflow

9
The pseudocode for this implementation is as follows:

Fig. 5 K-mer high-level solution in Spark

3 Implementation

Conceptually, K-mer counting using MapReduce is similar to the “word count” pro‐ gram, but
since there are no spaces in the human genome, we will count overlapping K-mers instead of
discrete words.
If the genome sequence is CACACACAGT and K=3, then we are counting 3-mers, and the map()
function (see example below) will output the following key-value pairs:

CAC 1

ACA 1

CAC 1

ACA 1

CAC 1

ACA 1

CAG 1

AGT 1

10
The sort and shuffle phase will sort the output of map() so that the same keys are grouped
together as follows:

ACA 1

ACA 1

ACA 1

CAC 1

CAC 1

CAC 1

CAG 1

AGT 1

Finally, the reduce() function will output:

ACA 3

CAC 3

CAG 1

AGT 1

Our Spark solution is implemented in a single Java driver class, thanks to the Spark API’s high
abstraction level. The Spark solution reads FASTQ files as input and con‐ verts them to JavaRDD.
Next, we filter the redundant records and keep only the sequence line out of every four
records. At this point, we have only the proper sequences from the input files. Then we
generate K-mers and find the frequency of them. Finally, we apply the Top N design pattern in
descending order.

Assume that we want to discover all K-mers (for a given K > 0) and the top N (for a given N > 0)
for a set of FASTQ files. Since the FASTQ file format is very well defined, first we create a
JavaRDD for the given FASTQ file. Next, we remove the records that are not sequences (those
similar to lines 1, 3, and 4 from the afore‐ mentioned input data). This filtering is implemented
by the JavaRDD.filter() func‐ tion. Once we have only sequences, we create (K, 1) pairs, where K
is a K-mer. Then, we find the frequency of K-mers. Finally, we can find the top N K-mers for N >
0. Finding the top N is simple: we assume that (K2 ,V2 ) are partitioned (K2 is a K-mer and V2 is
the frequency of K2 ), and then we map each partition into the top N. Once we have a top N list

11
(comprising one top N from each partition), we can do the final reduction to find the final top N.
We have three inputs to our Spark program:

• The FASTQ file stored in HDFS

• K > 0 to find K-mers

• N > 0 to find top N K-mers

4 Validation

Sequencing errors cause important difficulties for read analysis or genome assembly. The
correction or elimination of erroneous reads is made possible by the redundancy due to high
sequencing coverage. The solution is to monitor the occurrence number of all k-mers within the
reads to see if they conform to the expected coverage. For the task of k-mer counting, efficient
hashing techniques have been developed using parallel algorithms or Bloom filters. Howver,
their lack of scalability hinders indexing all 27-mers of typical Human sequencing data sets.
Based on the high level code presented in section 5.4 (pseudocode), I am attaching printscreens
of a few of the steps.
Step 5 reads the FASTQ file and creates a Jav aRDD object, where each record of the FASTQ file
is represented as a String object.

12
Fig. 6. Output for step 5 in the pseudocode

Step 6 uses a very powerful Spark API, JavaRDD.filter(), to filter out the redundant records
before applying the map() function. In a FASTQ file, we just keep the records that represent
DNA sequences.

Fig. 7. Output for step 6 in the pseudocode

13
Step 7 implements the mapper for K-mers by using the JavaRDD.flatMapToPair() function. The
mapper accepts a sequence and K (the size of K-mers) and then generates all (kmer, 1) pairs.

Fig. 8. Output for step 7 in the pseudocode

Step 8 implements the reducer for K-mers by using the JavaPairRDD.reduceByKey() function.
The reducer accepts a key (as K-mers) and values (frequencies of K-mers) and then generates a
final count of K-mers.

14
Fig. 9. Output for step 8 in the pseudocode

The final top 5 list (in descending order) is presented here:

=== top 5 ===

13 GGG

12 TGG

11 CCC

9 GGC

8 TTC

15
5 Conclusion

16

You might also like