0% found this document useful (0 votes)

80 views17 pages

Big Data and Genomics

The document discusses using big data technologies like Hadoop, MapReduce, and cloud computing to analyze large genomic datasets. It notes that advances in sequencing have led to a massive increase in biological data that exceeds the capabilities of existing algorithms. Distributed frameworks are now commonly used to process this "big data" in genomics. The benefits of these approaches include scalability, cost efficiency, and enabling new types of analyses that can lead to personalized medicine and other applications. Specific software discussed includes Hadoop, MapReduce, Microsoft Azure, and related tools.

Uploaded by

Andreea Vezeteu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views17 pages

Big Data and Genomics

Uploaded by

Andreea Vezeteu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Using big data for genomic medicine

Fodor,Andreea
[Email address]

0
1 Introduction

In recent times, bioinformatics has begun using a distributed approach based on the
MapReduce programming model, mainly because of the significant quantity of informational
data generated by the modern sequencing techniques. Despite this fact, using MapReduce and
related Big Data technologies and plans (for example Apache Hadoop and Spark) is not
generating satisfying results, failing both in efficiency and effectiveness. In this paper I will
elaborate more about the way the development of distributed and Big Data management
technologies has impacted the study of big datasets of sequences. Furthermore, I will display
how playing with the configuration of parameters might be imperative in order to obtain
improved performance, with emphasis on big quantities of data.

1.1 What is the problem

The increase of datasets available has become too accelerated for the existing algorithms that
are often used for the study of biological sequences. Because of this, leaning towards a
computational approach to process big data is becoming more and more popular among
researchers, in order to solve issues related to large amount of data.

Genetic disease diagnosis and risk factors predictions have received a tremendous amount of
help from genome analysis. While one exome gathers as much as 15 GB of data that is encoded
as a FastQ file, a whole genome can reach a size of up to 1TB of data. Advanced variant
discovery and interpretation processes may last up to 10 hours to process one exome,
depending on the type of analysis. Since whole-genome sequencing has become easy to reach
economically for the population, personalized medicine will proportionally require scalable
variant analysis solutions.

For some variations, the variant discovery process is represented by a pipeline where data
flows through a series of thoroughly studied stages, from the initial reads off the sequencing
machine, to a series of interpreted variants by a clinician.

1
In his research paper called “Multiple comparative metagenomics using multiset k-mer
counting”, Gaëtan Benoit raises the concern that the fundamental domain of alignment-free
linguistic and informational analysis of genomic and proteomic sequences has received little
attention in this context. This being mentioned, it is important to state that the collection of the
k-mer statistics is an essential function that stands at the center of this domain, i.e., how many
times each sequence of length k over a finite alphabet appears in a set of biological sequences,
at a genomic scale.

1.2 Why big data is necessary

Building improved models, which generate higher precision results is possible thanks to big
data. Companies have developed innovative approaches in order to market themselves and
increase their sales. Big data also participates in the way human resources are managed, or the
way disasters are responded to, besides many other such applications that prove its importance
in influencing decisions.

Due to genome sequencing technology increase, there is an unprecedented draw in biomedical

big data that the life sciences industry is experiencing. The biomedical data is extracted and
studied, helping in the creation of personalized medicine and other related applications.

Being one of the highest growing big data types, it allows researchers and doctors to prescribe
personalized medicine. Between 100 million and 2 billion human genomes could be sequenced
by year 2025. This sequence data demands for between 2 exabytes and 40 exabytes in data
storage. In comparison, all of YouTube only requires 1 to 2 exabytes a year. An exabyte is 10 to
the power 18 bites. That is, 18 zeros after 40.

The major downside is its increase cost of analysis for such massive volumes, taking up to 10k
trillion CPU hours in total.

Personalized medicine has allowed patients that do not have any specific type or stage of
cancer, go from receiving the same treatment to having unique prescriptions based on their
genomic information. By investing in big data research, companies enable the development of

2
better ways of studying large scale data in order to find solutions tailored for each individual
patient, thus proving the effectiveness.

A big challenge in biomedical big data applications, like many other fields, is how we can
integrate many types of data sources to gain further insight problem.

1.3 What software is needed

For this research we will use Microsoft Azure. HDInsight is an Apache Hadoop implementation
that runs in globally distributed Microsoft datacenters. It’s a service that allows you to easily
build a Hadoop cluster in minutes when you need it, and tear it down after you run your
MapReduce jobs. As Windows Azure Insiders, we believe there are a couple key value
propositions of HDInsight. The first is that it’s 100 percent Apache-based, not a special
Microsoft version, meaning that as Hadoop evolves, Microsoft will embrace the newer versions.
Moreover, Microsoft is a major contributor to the Hadoop/Apache project and has provided a
great deal of its query optimization know-how to the query tooling, Hive.

The second aspect of HDInsight that’s compelling is that it works seamlessly with Windows
Azure Blobs, mechanisms for storing large amounts of unstructured data that can be accessed
from anywhere in the world via HTTP or HTTPS. HDInsight also makes it possible to persist the
meta-data of table definitions in SQL Server so that when the cluster is shut down, you don’t
have to re-create your data models from scratch.

Hadoop, a data processing tool that has revolutionized the world of computer science, being
one of the most used technologies in the big data domain.

MapReduce, Google’s solution for processing big data is also used. Google has been the first to
truly experience the “big data tsunami” while indexing a huge amount of webpage data in a
short amount of time. MapReduce is a Java written software framework, which was created in
order to run over a cluster of machines in a distributed way.

The GFS (Google File system) is a cluster of thousands of computers where data is split into
smaller pieces and distributed. A parallelized programming API called MapReduce is used to

3
scatter the computations to the location of the data (Map) and to gather the results at the end
(Reduce).

Hadoop is used by all the leading companies - Facebook, Twitter. It is an open-source

implementation of Google’s solution, made up of MapReduce and the Hadoop Distributed File
System. It depends on a strategy plan of moving data and processing in order to increase
performance.

A private infrastructure can be used to run Hadoop clusters, although there is a recent increase
in the popularity of public services such as Amazon Elastic MapReduce. This allows users to
efficiently manipulate significant data sets and use various techniques to analyze data, such as
data mining, machine learning and statistical analysis.

A downside for this approach would be that it’s no easy task to program a Hadoop cluster, since
it requires deep understanding of Java for developing parallelized programs.

By using big data platforms and analytics in the genomics domain of life sciences has both the
potential to transform and save lives, making this type of research an ideal use for big data
technologies. They can have a significantly positive impact on humankind and not just
humankind.

4
Fig. 1 depicts the breadth and depth of Hadoop support in the Windows Azure platform.

To connect to the newly created cluster, I have used Putty, a free and open-source terminal
emulator, serial console and network file transfer application. It supports several network
protocols, including SCP, SSH, Telnet, rlogin, and raw socket connection. It can also connect to a
serial port. The connection type for this project is SSH.

1.4 What are the benefits to model the system with big data technology?

An important tool in processing genomic related big data is cloud computing. This technology
offers a scalable and cost-efficient solution. The NIST (National Institute for Standards and
Technology) has stated that Cloud computing is ‘‘a pay-per-use model for enabling convenient,
5
on-demand network access to a shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications and services) that can be rapidly provisioned and
released with minimal management effort or service provider interaction’’. Concepts like
distributed systems, grid computing and parallelized programming do not qualify anymore as
new, virtualization technology is one the first enablers of Cloud.

The business model for cloud has been facilitated to evolve, enabling widespread rollout. An
individual physical machine can now host several virtual machines through the process of
virtualization, thus creating the perfect conditions for maximum hardware utilization. A virtual
machine is an emulation of a computer system. Virtual machines are based on computer
architectures and provide functionality of a physical computer. Their implementations may
involve specialized hardware, software, or a combination.

A Hypervisor, a virtualization management layer, translates the requests from the VM to the
underlying hardware (CPU, memory, hard disks and network connectivity).

2 The methodology

I have mentioned in a previous section about the concept of “k-mer”. A k-mer is a substring of
length K (K>0), and counting the occurences of all such substrings is a central step in many
analyses of DNA sequence data. Counting K-mers for a DNA sequence means finding
frequencies of K-mers for the entire sequence. In bioinformatics, K-mer counting is used for
genome and transcriptome assembly, metagenomic sequencing, and error correction of
sequence reads. Although simple in principle, K-mer counting is a big data challenge, since a
single DNA sample can contain several billion DNA sequences. The K-mer counting problem has
been defined by the Schatz lab.

– Map: input -> key:value pairs

– Shuffle: Group together pairs with same key
– Reduce: key, value-lists -> output

6
2.1 What is the system meant to do?

The system is meant to discover all K-mers for a given k>0 and find the top N k-mers for a given
N>0.

Fig. 2. Map, Shuffle & Reduce All Run in Parallel

2.2 Research questions and how we address these

Decomposing a sequence into its k-mers for analysis allows this set of fixed-size chunks to be
analysed rather than the sequence, and this can be more efficient. K-mers are very useful in
sequence matching (string matching with n-grams has a rich history), and set operations are
faster, easier, and there are a lot of readily-available algorithms and techniques to work with
them. A simple example: to check if a sequence S comes from organism A or from organism B,
assuming the genomes of A and B are known and sufficiently different, we can check if S
contains more k-mers present in A or in B. Yes, there are many tools that do just that.

Basically, using k-mers simplifies bioinformatics to counting and comparing whether things are
there or not.
The question that I will try to answer in this example is: What are the top 10 most frequently
occurring 9-mers in E. coli?
I have accessed sample data to test this k-mer counting solution from
http://bit.ly/e_coli_genome.

7
Conceptually, K-mer counting using MapReduce is similar to the “word count” pro‐ gram, but
since there are no spaces in the human genome, we will count overlapping K-mers instead of
discrete words.

2.3 The architecture

In order to implement an algorithm for k-mer counting, I have created a Hadoop cluster using
Microsoft Azure.
This cluster uses 2 nodes for the Master and 2 nodes for the workers, so a total of 4 nodes.

Fig. 3. Azure HDInsight Cluster configuration

2.4 Diagrams, pseudocode and workflow

The workflow of the k-mer MapReduce is illustrated below:

8
FASTQ file

Filter redundant records

DNA sequences
…

Map()

(k=k-mer,v=1)
…

Reduce()

(K=k-mer,v=frequency)

mapPartition() mapPartition() mapPartition()

find local top N find local top N find local top N

Find final top N

Top N

Fig.4 Workflow

9
The pseudocode for this implementation is as follows:

Fig. 5 K-mer high-level solution in Spark

3 Implementation

Conceptually, K-mer counting using MapReduce is similar to the “word count” pro‐ gram, but
since there are no spaces in the human genome, we will count overlapping K-mers instead of
discrete words.
If the genome sequence is CACACACAGT and K=3, then we are counting 3-mers, and the map()
function (see example below) will output the following key-value pairs:

CAC 1

ACA 1

CAC 1

ACA 1

CAC 1

ACA 1

CAG 1

AGT 1

10
The sort and shuffle phase will sort the output of map() so that the same keys are grouped
together as follows:

ACA 1

CAC 1

CAG 1

AGT 1

Finally, the reduce() function will output:

ACA 3

CAC 3

CAG 1

AGT 1

Our Spark solution is implemented in a single Java driver class, thanks to the Spark API’s high
abstraction level. The Spark solution reads FASTQ files as input and con‐ verts them to JavaRDD.
Next, we filter the redundant records and keep only the sequence line out of every four
records. At this point, we have only the proper sequences from the input files. Then we
generate K-mers and find the frequency of them. Finally, we apply the Top N design pattern in
descending order.

Assume that we want to discover all K-mers (for a given K > 0) and the top N (for a given N > 0)
for a set of FASTQ files. Since the FASTQ file format is very well defined, first we create a
JavaRDD for the given FASTQ file. Next, we remove the records that are not sequences (those
similar to lines 1, 3, and 4 from the afore‐ mentioned input data). This filtering is implemented
by the JavaRDD.filter() func‐ tion. Once we have only sequences, we create (K, 1) pairs, where K
is a K-mer. Then, we find the frequency of K-mers. Finally, we can find the top N K-mers for N >
0. Finding the top N is simple: we assume that (K2 ,V2 ) are partitioned (K2 is a K-mer and V2 is
the frequency of K2 ), and then we map each partition into the top N. Once we have a top N list

11
(comprising one top N from each partition), we can do the final reduction to find the final top N.
We have three inputs to our Spark program:

• The FASTQ file stored in HDFS

• K > 0 to find K-mers

• N > 0 to find top N K-mers

4 Validation

Sequencing errors cause important difficulties for read analysis or genome assembly. The
correction or elimination of erroneous reads is made possible by the redundancy due to high
sequencing coverage. The solution is to monitor the occurrence number of all k-mers within the
reads to see if they conform to the expected coverage. For the task of k-mer counting, efficient
hashing techniques have been developed using parallel algorithms or Bloom filters. Howver,
their lack of scalability hinders indexing all 27-mers of typical Human sequencing data sets.
Based on the high level code presented in section 5.4 (pseudocode), I am attaching printscreens
of a few of the steps.
Step 5 reads the FASTQ file and creates a Jav aRDD object, where each record of the FASTQ file
is represented as a String object.

12
Fig. 6. Output for step 5 in the pseudocode

Step 6 uses a very powerful Spark API, JavaRDD.filter(), to filter out the redundant records
before applying the map() function. In a FASTQ file, we just keep the records that represent
DNA sequences.

Fig. 7. Output for step 6 in the pseudocode

13
Step 7 implements the mapper for K-mers by using the JavaRDD.flatMapToPair() function. The
mapper accepts a sequence and K (the size of K-mers) and then generates all (kmer, 1) pairs.

Fig. 8. Output for step 7 in the pseudocode

Step 8 implements the reducer for K-mers by using the JavaPairRDD.reduceByKey() function.
The reducer accepts a key (as K-mers) and values (frequencies of K-mers) and then generates a
final count of K-mers.

14
Fig. 9. Output for step 8 in the pseudocode

The final top 5 list (in descending order) is presented here:

=== top 5 ===

13 GGG

12 TGG

11 CCC

9 GGC

8 TTC

15
5 Conclusion

PC Software
100% (1)
PC Software
9 pages
Big Data in Healthcare
No ratings yet
Big Data in Healthcare
14 pages
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Big Data Challenges in Bioinformatics
No ratings yet
Big Data Challenges in Bioinformatics
47 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Big Data Hadoop in Health Care
No ratings yet
Big Data Hadoop in Health Care
51 pages
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
Big Data and Hadoop: A Review Paper
No ratings yet
Big Data and Hadoop: A Review Paper
3 pages
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
Full Doc Janani
No ratings yet
Full Doc Janani
121 pages
Grid Computing: A Revolutionary Approach to Scientific Research and Data Management
From Everand
Grid Computing: A Revolutionary Approach to Scientific Research and Data Management
Pasquale De Marco
No ratings yet
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
Intro Biol Notes
No ratings yet
Intro Biol Notes
49 pages
Cours BI 23 24 Session 4 2
No ratings yet
Cours BI 23 24 Session 4 2
46 pages
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Ashish_Presentation_Stage1_modify_LR
No ratings yet
Ashish_Presentation_Stage1_modify_LR
24 pages
4 A Review Paper On Big Data and Hadoop
No ratings yet
4 A Review Paper On Big Data and Hadoop
3 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
221 pages
Challenges of Big Data Integration in The Life Sciences
No ratings yet
Challenges of Big Data Integration in The Life Sciences
10 pages
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
From Everand
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
Editor IJSMI
No ratings yet
BigData-Assignment1-CSP 554
No ratings yet
BigData-Assignment1-CSP 554
4 pages
Big Data Analytics TEXTBOOK
No ratings yet
Big Data Analytics TEXTBOOK
230 pages
Latest Research Papers on Big Data PDF
No ratings yet
Latest Research Papers on Big Data PDF
4 pages
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet
Lec 7 Hadoop Intro
No ratings yet
Lec 7 Hadoop Intro
48 pages
IEEE Conf Paper Formatvv
No ratings yet
IEEE Conf Paper Formatvv
5 pages
Big Data: the Revolution That Is Transforming Our Work, Market and World
From Everand
Big Data: the Revolution That Is Transforming Our Work, Market and World
PAT NAKAMOTO
No ratings yet
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Sciencedirect Big Data Analytics For Personalized Medicine: Davide Cirillo and Alfonso Valencia
No ratings yet
Sciencedirect Big Data Analytics For Personalized Medicine: Davide Cirillo and Alfonso Valencia
10 pages
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
No ratings yet
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
8 pages
Rule Based System: Fundamentals and Applications
From Everand
Rule Based System: Fundamentals and Applications
Fouad Sabry
No ratings yet
Big Data in Health Care Sector: Department of Computer Applications
No ratings yet
Big Data in Health Care Sector: Department of Computer Applications
9 pages
Cloud-Based Multi-Modal Information Analytics
From Everand
Cloud-Based Multi-Modal Information Analytics
Tanushri Kaniyar
No ratings yet
Database Search, Alignment Viewer and Genomics Analysis Tools: Big Data For Bioinformatics
No ratings yet
Database Search, Alignment Viewer and Genomics Analysis Tools: Big Data For Bioinformatics
12 pages
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
From Everand
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
daniel Huston
No ratings yet
BDA Lecture Notes Updated Unit 1
No ratings yet
BDA Lecture Notes Updated Unit 1
37 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
From Everand
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Big Data Analytics in Health Care A Review Paper
No ratings yet
Big Data Analytics in Health Care A Review Paper
12 pages
BIG DATA For Healthcare A Survey
No ratings yet
BIG DATA For Healthcare A Survey
12 pages
Virtual Report Processing: The Mapper Story
From Everand
Virtual Report Processing: The Mapper Story
Louis Schlueter
No ratings yet
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
No ratings yet
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
13 pages
Reality Mining: Using Big Data to Engineer a Better World
From Everand
Reality Mining: Using Big Data to Engineer a Better World
Nathan Eagle
4/5 (2)
Preparing Next-Generation Scientists For Biomedical Big Data: Artificial Intelligence Approaches
No ratings yet
Preparing Next-Generation Scientists For Biomedical Big Data: Artificial Intelligence Approaches
11 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
I Jcs It 20150605100
No ratings yet
I Jcs It 20150605100
4 pages
Big Data and Data Science: Analytics for the Future
From Everand
Big Data and Data Science: Analytics for the Future
Dhaanyalakshmi Ahuja
No ratings yet
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
Google Genomics Whitepaper PDF
No ratings yet
Google Genomics Whitepaper PDF
10 pages
Big Data Analytics
100% (3)
Big Data Analytics
79 pages
Guha Roy 2017
No ratings yet
Guha Roy 2017
3 pages
Bigdata-cloud computing A K Mishra
No ratings yet
Bigdata-cloud computing A K Mishra
48 pages
2018 HISISE BigDatavf
No ratings yet
2018 HISISE BigDatavf
10 pages
HAP-780-15-Big-Data
No ratings yet
HAP-780-15-Big-Data
19 pages
PODCASTING
No ratings yet
PODCASTING
3 pages
FP6606C
No ratings yet
FP6606C
9 pages
Eyeprocucts Samples
No ratings yet
Eyeprocucts Samples
33 pages
Accounts For Automatic Transactions - Learn - Microsoft Docs
No ratings yet
Accounts For Automatic Transactions - Learn - Microsoft Docs
4 pages
CyberArk UseCase 1
No ratings yet
CyberArk UseCase 1
1 page
Aniket Asole_Senior Executive Analytics MMM_HR Central
No ratings yet
Aniket Asole_Senior Executive Analytics MMM_HR Central
1 page
Basics of C Programming U24ge1108
No ratings yet
Basics of C Programming U24ge1108
3 pages
Rubric OOP 2nd Sem SY 2022 2023
No ratings yet
Rubric OOP 2nd Sem SY 2022 2023
3 pages
Plagiarism and Copy Right
No ratings yet
Plagiarism and Copy Right
9 pages
A Model of The Relay Valve Used in An Air Brake System
No ratings yet
A Model of The Relay Valve Used in An Air Brake System
13 pages
Assisting in Writing Wikipedia-Like Articles From Scratch With Large Language Models
No ratings yet
Assisting in Writing Wikipedia-Like Articles From Scratch With Large Language Models
27 pages
Pic16f688 DVM
No ratings yet
Pic16f688 DVM
7 pages
Ai in The Wild
No ratings yet
Ai in The Wild
3 pages
Mastering Google App Engine - Sample Chapter
No ratings yet
Mastering Google App Engine - Sample Chapter
32 pages
Tle 7 - Ict - Animation - M2 - V3
No ratings yet
Tle 7 - Ict - Animation - M2 - V3
23 pages
Assignment ADCIM 2007
No ratings yet
Assignment ADCIM 2007
17 pages
Polynomial Division Problem and Its Synthetic Counterpart
No ratings yet
Polynomial Division Problem and Its Synthetic Counterpart
2 pages
Our Lady of Guadalupe Minor Seminary Area A Table of Specification 1 Quarter Mathematics 7 School Year: 2020-2021
No ratings yet
Our Lady of Guadalupe Minor Seminary Area A Table of Specification 1 Quarter Mathematics 7 School Year: 2020-2021
1 page
Defence Data Management: A Case Study of The Nigerian Army
No ratings yet
Defence Data Management: A Case Study of The Nigerian Army
24 pages
Gul Nawaz CV
No ratings yet
Gul Nawaz CV
2 pages
Digital Marketing Workshop - Course Details
No ratings yet
Digital Marketing Workshop - Course Details
11 pages
Walchand Institute of Technology, Solapur: Direct Linking Loaders
No ratings yet
Walchand Institute of Technology, Solapur: Direct Linking Loaders
14 pages
Sulink - Epson WF-C5710 - Manual de Serviço Técnico (English - Revision A - 28-02-2018)
No ratings yet
Sulink - Epson WF-C5710 - Manual de Serviço Técnico (English - Revision A - 28-02-2018)
603 pages
chapter-6-oops (1)
No ratings yet
chapter-6-oops (1)
11 pages
Dissertation Route Map
67% (3)
Dissertation Route Map
4 pages
Basic Concepts in Modal Logic: 1 Edward N. Zalta
No ratings yet
Basic Concepts in Modal Logic: 1 Edward N. Zalta
92 pages
ISO 9001 2015 2008 Clause by Clause Matrix
100% (1)
ISO 9001 2015 2008 Clause by Clause Matrix
37 pages
Drawing CHECKING Basics
No ratings yet
Drawing CHECKING Basics
4 pages
21bce2676 VL2023240503020 Ast03
No ratings yet
21bce2676 VL2023240503020 Ast03
10 pages

Big Data and Genomics

Uploaded by

Big Data and Genomics

Uploaded by

Using big data for genomic medicine

1.1 What is the problem

1.2 Why big data is necessary

Due to genome sequencing technology increase, there is an unprecedented draw in biomedical

1.3 What software is needed

Hadoop is used by all the leading companies - Facebook, Twitter. It is an open-source

– Map: input -> key:value pairs

Fig. 2. Map, Shuffle & Reduce All Run in Parallel

2.2 Research questions and how we address these

2.3 The architecture

Fig. 3. Azure HDInsight Cluster configuration

2.4 Diagrams, pseudocode and workflow

The workflow of the k-mer MapReduce is illustrated below:

Filter redundant records

mapPartition() mapPartition() mapPartition()

Find final top N

Fig. 5 K-mer high-level solution in Spark

Finally, the reduce() function will output:

• The FASTQ file stored in HDFS

• K > 0 to find K-mers

• N > 0 to find top N K-mers

Fig. 7. Output for step 6 in the pseudocode

Fig. 8. Output for step 7 in the pseudocode

The final top 5 list (in descending order) is presented here:

=== top 5 ===

You might also like