SlideShare a Scribd company logo
Indonesian Journal of Electrical Engineering and Computer Science
Vol. 25, No. 2, February 2022, pp. 1078~1086
ISSN: 2502-4752, DOI: 10.11591/ijeecs.v25.i2.pp1078-1086  1078
Journal homepage: http://ijeecs.iaescore.com
An efficient and robust parallel scheduler for bioinformatics
applications in a public cloud: A bigdata approach
Leena Ammanna1
, Jagadeeshgowda2
, Jagadeesh Pujari1
1
Department of Information Science and Engineering, SDM College of Engineering and Technology, Dharwad, India
2
Department of Computer Science and Engineering, Shri Krishna Institute of Technology, Bengaluru, India
Article Info ABSTRACT
Article history:
Received Jul 7, 2021
Revised Dec 21, 2021
Accepted Dec 29, 2021
In bioinformatics, genomic sequence alignment is a simple method for
handling and analysing data, and it is one of the most important applications
in determining the structure and function of protein sequences and nucleic
acids. The basic local alignment search tool (BLAST) algorithm, which is
one of the most frequently used local sequence alignment algorithms, is
covered in detail here. Currently, the NCBI's BLAST algorithm (stand-
alone) is unable to handle biological data in the terabytes. To address this
problem, a variety of schedulers have been proposed. Existing sequencing
approaches are based on the Hadoop MapReduce (MR) framework, which
enables a diverse set of applications and employs a serial execution strategy
that takes a long time and consumes a lot of computing resources. The
author, improves the BLAST algorithm based on the BLAST-BSPMR
algorithm to achieve the BLAST algorithm. To address the issue with
Hadoop's MapReduce framework, a customised MapReduce framework is
developed on the Azure cloud platform. The experiment findings indicate
that the suggested bulk synchronous parallel MapReduce-basic local
alignment search tool (BSPMR-BLAST) algorithm matches bioinformatics
genomic sequences more quickly than the existing Hadoop-BLAST method,
and that the proposed customised scheduler is extremely stable and scalable.
Keywords:
Bioinformatics genomic
sequence
Hadoop
Microsoft azure
Scheduler parallelization
This is an open access article under the CC BY-SA license.
Corresponding Author:
Jagadeeshgowda
Department of Computer Science and Engineering, Shri Krishna Institute of Technology
Bengaluru, India
Email: hodcs.skitblr@gmail.com
1. INTRODUCTION
According to research [1], [2], cloud computing has established itself as the future norm for data-
intensive computing. According to [3], [4] cloud computing platforms allow on-demand access to shared,
scalable, fault-tolerant, and reconfigurable computing resources with little administrative work and a low
cost. In comparison to freestanding private computer clusters [5], [6] demonstrate how cloud computing
platforms are a commonly desired and recognised method for running large data applications or high-
performance computations (HPC). The cloud's resource management, virtual computing platforms, and
elasticity all contribute to the ease with which data-intensive applications can be migrated to the cloud
paradigm. Cai et al. [7], Okur and Büyükkeçeci [8] note that the frameworks required to enable the most
efficient use of cloud resources at the lowest possible cost for data-intensive applications remain unresolved.
Google’s MapReduce framework discussed in [9] is a commonly used approach for performing distributed
computations on the cloud. White [10] and Hadoop [11] present that implementations of Hadoop are based on
the MapReduce architecture and are capable of supporting large-scale data applications. The MapReduce
model employs a two-stage execution mechanism. The initial stage entails splitting or dividing the data to be
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 
An efficient and robust parallel scheduler for bioinformatics … (Leena Ammanna)
1079
processed into little pieces. Each small bit of data is associated with a mapper. The mapper
outputs ⟨𝐾𝑒𝑦|𝑉𝑎𝑙𝑢𝑒⟩ pairs that are arranged according to the 𝐾𝑒𝑦values. The Reduce workers are provided
with the sorted values using the ⟨𝐾𝑒𝑦|𝑆𝑜𝑟𝑡𝑒𝑑𝐿𝑖𝑠𝑡(𝑉𝑎𝑙𝑢𝑒)⟩ method. The Hadoop distributed file system is
used to store the results of the reduce workers hadoop distributed file system (HDFS). In most public cloud
setups, the Map and Reduce workers are virtual machines (VMs).
The researchers recognised the disadvantages of Hadoop MapReduce and examined several
improvement strategies. Zhu et al. [12] discusses the use of Hadoop with CUDA to increase processing
capacity and execution speed. Zhu et al. [12] proposes a research paradigm that facilitates appropriate use of
graphics processing units (GPU). Dahiphale et al. [13] presents a cloud-based MapReduce model. Pipelining
is used to improve execution performance and to enable elastic pricing. For scientific applications [14]
proposed a framework based on parallel tempering called replica exchange statistical temperature molecular
dynamics. Tang et al. [15] describe a unique dynamic Hadoop slot allocation technique for resolving the
Hadoop resource provisioning problem that is not optimal. The result reported in Tang et al. [15] demonstrates
that their technique significantly improves Hadoop's performance when handling many MapReduce jobs. The
most often utilised technique is to increase execution effectiveness through scheduling practises, as described
in [16]-[19].
Wang et al. [20] presented a method for computing "Imprecise Applications" using MR frameworks.
Reduce is run after the Map step in predictable MapReduce applications. In the case of inaccurate applications,
the Reduce stage might be initiated using the Map stage's partial findings. Applications such as word
frequency statistics and hot-word identification are considered imprecise. Executing inaccurate apps on well-
known machine learning frameworks like Hadoop introduces latency. Running erroneous apps might result in
additional expenditures on public cloud systems, where users are charged for all computations and storage
services consumed. To address this shortcoming incorporated the check phase into the MapReduce framework
in order to reduce costs and execution latencies [20].
In MR frameworks, iterative applications and certain graph-based applications performed poorly.
Google introduced the Pregel framework for cloud computations in [21] for similar applications. Pregel is built
on valiant's bulk synchronous parallel (BSP) computation model [22]. Kajdanowicz et al. [23] demonstrate
that the Pregel framework outperforms MapReduce when calculating graph-based applications. Apart from the
serial execution mechanism utilised in MR, several concerns remain unresolved, including node failure
management capabilities, scheduling strategies, and multiway join processes.
The authors provide a framework for parallel computation MapReduce (PMR) in public cloud
environments in this article. The MapReduce architecture of the BLAST-BSPMR enables parallel
computation to accelerate the execution of the BSP model. The BLAST-BSPMR implements Map and
Reduce workers by utilising Microsoft Azure VM computing environments. Multicore processors enable
parallel computation in these virtualized computing environments. The BLAST-BSPMR proposal leverages
this parallel execution feature to drastically lower the Map and Reduce worker nodes' computation times. In
comparison to other MapReduce frameworks like Hadoop, the BLAST-Reduce BSPMR's phase is initiated
when two or more worker nodes have completed their jobs. The BLAST-BSPMR function shown here is
used to calculate the results of bioinformatics BLASTx applications.
Several strategies and scheduler are proposed by researchers to improve the performance of the
MapReduce cloud computing framework proposed by Google in Dean et al. [9]. The work presented in Zhu,
et al. [12] endures the closest similarity to our work presented here. Using CUDA codes the Map and Reduce
worker tasks execute parallely on the GPU’s is achieved. The integration of Hadoop and GPU is achieved
using the Hadoop Streaming, Pipes, JCuda and JNI approaches. The experimental investigation given in Jie
Zhu et al. [12] demonstrates the effectiveness of the suggested approach on a private heterogeneous cloud
environment using the word count application. The execution efficiency of the JCuda approach over Hadoop
Streaming, Pipes and JNI is also proved. The major drawback of this approach is that such computational
models are not suited for public cloud environments as GPU based VM environments aregenerally not offered.
Public cloud environments like Amazon EC2 that offer such GPU based virtualized computing environments
provide it at very high costs.
Dahiphale et al. [13] described the drawbacks of the conventional MapReduce frameworks as
followsthe MapReduce framework adopts a sequential processing of the Map and Reduce stages. The
scalability of the MapReduce is limited. The MapReduce framework provides no support for flexible pricing
options. The MapReduce model provides no support for computing streaming data. To overcome these
drawbacks a pipelined model is presented to parallelize the execution of the Map and Reduce phase. The
MapReduce model proposed in [13] is realized on the Amazon public cloud. The spot instance offering of the
Amazon cloud allows flexible pricing. The experimental study given here demonstrates the efficiency of the
pipelining-based MapReduce model when compared to the conventional MapReduce model using the word
count application. The major drawback of the model proposed in [13] is that the locality optimization is not
 ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 25, No. 2, February 2022: 1078-1086
1080
considered and hosting of additional data dependent applications like Smith Waterman and other
bioinformatics application cannot be executed in a pipelined fashion.
The computation of certain data intensive applications like graph applications and iterative
applications on MapReduce frameworks exhibit high computation time and cost. To support such applications,
Google presented a proprietary cloud computing framework named Pregel that embraces the 𝐵𝑆𝑃 computing
model [21]. In Pregel, the graph computations are achieved using a set of super-steps. A super step is used to
execute the user defined application or function in a parallel fashion using the data item from the database.
Each data item from the database behaves as an agent. The Pregel system adopts vertex-centric execution
strategy. The computation of each data item has a graph like representation in 𝐵𝑆𝑃. The vertexes in the Pregel
deactivate post the computation operation and are reactivated only if additional data items are presented to
them. Once all the vertices are deactivated the computation is said to be complete. The local storage of the data
items in the nodes executing the computation poses a problem. In the case the data item is large then a spilling-
to-disk technique needs to be in [23].
The MapReduce framework is been adopted by the Hadoop framework in Nguyen et al. [24] for
computing on the cloud platform. The MapReduce paradigm employs a two-phase technique. The first phase
divides the input data into little bits of data to be processed. Each small bit of data is associated with a mapper.
The mapper outputs a<Key, Value> pair that is sorted according to the Key values. Thereducer takes these
sorted values as the <Key│SortedList (Value)>. These sorted values are stored in the Hadoop Distributed File
System. Schatz [25] and CloudBurst make use of the Hadoop MapReduce framework as the computing
platform. These alignment tools perform effectively for small base pair alignments requiring single-gap or un-
gapped alignment. However, when huge base pairs are considered, these aligners perform poorly. Because all
present cloud framework sequence aligners take the Hadoop framework into account, Hadoop suffers when
iterative applications are hosted on the cloud framework. When considering multiway joins, Hadoop performs
poorly. Since Hadoop framework executes sequentially, again there is degradation in the performance.
The 𝐵𝑆𝑃 based Pregel framework exhibits lower computation time for selective applications. In this paper
the 𝐵𝐿𝐴𝑆𝑇 − 𝐵𝑆𝑃𝑀𝑅 framework incorporates the MapReduce architecture and an execution strategy is
conducted in parallel fashion as observed in the 𝐵𝑆𝑃.
2. METHOD
A cloud platform, BLAST-BSPMR is proposed that allows to apply bioinformatics application like
gene sequencing. The BLAST-BSPMR uses MapReduce framework which works on a cloud computing
environment. Azure VM is the Azure infrastructure as service (IaaS) which is used to deploy persistent VMs.
Thus, the Map and Reduce worker nodes in BLAST-BSPMR are deployed on Azure VMs. Alignment of
genomic sequences is performed using the BLASTxmethod. Here in BLASTx, a DNA query is compared to
a protein database. The BLAST-BSPMR algorithm performs sequence alignment in two stages, namely the
Map and Reduce phases. SinceHadoop framework is sequential, only when the Map phase is completed the
Reduce phase gets executed. To prevent sequential execution, BLAST-BSPMR considers parallelizing the
Map and Reduce phases. The Map and Reduce functions are meant to be executed in parallel and to make
optimal use of the cores available in the worker VMs.
2.1. Basic local alignment search tool (BLAST) algorithm
Basic local alignment search tool (BLAST) is used to compare a query sequence with the database
sequence to find similar sequences in the database. The BLAST consists of various sequence searching
programmes (including BLASTn, tBLASTn, BLASTx, BLASTp, tBLASTp, and tBLASTx). Each
application varies in its functionalities, but the core of the algorithm for all of them is basically the same. The
BLAST algorithm consists of three steps. In the initial step, called as building the word list, the query
sequence is split intosmall words of fixed size (w-mers), often w being 11. Therefore, if the length of the
query sequence is n, then list of n-w+1 words are being constructed. This step is illustrated in Figure 1. The
second stage, referred to as the scan for hits, is where the BLAST algorithm looks for matches to the words
(referred to as hits) in the database sequence. The BLAST algorithm employs the 'two-hit' strategy. If two
non-overlapping word pairs are found to be within a distance D of one another, an extension is initiated. The
next stage, dubbed extend the hits, involves the BLAST algorithm expanding the results in order to find
longer related segment pairs. When the "two-hit" requirement is met, a gap-free extension in two directions is
performed to locate an alignment known as a high-scoring segment pair (HSP). Figure 2 gives an example of
gap-free extension. This is the seed-and-extend heuristic strategy for identifying high-scoring gene sequence
alignments between the genomic query sequence and the database's genomic sequences.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 
An efficient and robust parallel scheduler for bioinformatics … (Leena Ammanna)
1081
Figure 1. Build word list Figure 2. Build word list
2.2. BLASTon the proposed parallelized MapReduce framework
The genomic sequence database was initially split into trivial chunks of static size and scheduled
over the virtual Azure computing nodes in the current work. Numerous computing nodes can process the
chunks in parallel. The three core BLAST processes (building word lists, scanning, and extension) were then
reorganised and run concurrently. It takes longer in two stages, according to the BLAST algorithm: searching
for matching terms in the genome database and extending the seeds. Thus, the proposed BLAST-BSPMR
framework is employed to parallelize the BLAST algorithm, thereby increasing the calculation's overall
efficiency. The BLAST-BSPMR technique is implemented in three stages: data storage in Azure blob
storage, genomic sequence preparation, and genomic sequence parallelization. The first stage involves
transferring the genomic database and query sequence to Azure blob storage. The second level preprocesses
each computational worker node's genetic data. The third stage performs further accurate matches, extension,
and statistics on the preprocessed sequence seeds prior to giving them as input to the BLAST-BSPMR
algorithm. The sections that follow describe the suggested method's several stages.
2.2.1. Pre-processing of gene sequence data
As stated in the basic procedure of the NCBI BLAST algorithm, the first criterion the BLAST
algorithm is to generate a word list for the BLAST-BSPMR method. The purpose of identifying words with a
high correlation coefficient during the pre-processing of genomic sequence data is to simplify calculations or
computations concerning accurate/correct matches and seed expansion. This technique accelerates the third
phase of sequence alignment. Due to the fact that the research query data consists of nucleic acid sequences,
the lengths of the genomic query and database sequences are x and y, respectively, and the number of words
is x-10 and y-10. The following stages must be followed in the pre-processing of genomic sequence data:
Stages of mapping, sorting, shuffle, and reduction. The author generates a statistic of nucleobases (ATGC)
with the highest occurrence number using a Map scheme on each word in the Map stage, such as <T,4>,
where T indicates the most often occurring nucleobase and 4 denotes the number. The output of Map is
sorted by the frequency of each nucleobase during the sort stage. In the shuffle phase, the sort stage's results
are jumbled and merged. If the frequency of nucleobase T in the word𝑞𝑠𝑖(i = 0, 11, … . , 10y + 1, … . ) in the
genomic query sequence is num1 and the count of nucleobase T in the word𝑑𝑏𝑖(i = 0, 11, … . , 10y +
1, … . ) in the genomic database sequence is num2, 𝑞𝑠𝑖will combine with all𝑑𝑏𝑖, and rank them according to
the size of num1+num2. List of all𝑑𝑏𝑖combined with𝑞𝑠𝑖as items in the BLAST-BSPMR algorithm's word list
in the final stage of Reduce, based on the output of the shuffle stage. The following step is to save the results
to Azure Blob Storage for later analysis.
2.2.2. Alignment of gene sequences in parallel
The word list for the scanner is produced from preprocessed genomic sequencing data. The three
basic steps in the parallel alignment of genomic sequences using a word list and a scanner are exact word
matches, seed expansion, and statistical significance during seed expansion. To facilitate parallelization of
genomic sequence alignment, the generated word list and scanner should be distributed among a large
number of virtual processing nodes. The flow chart in Figure 3 depicts the parallelization of the BLAST-
BSPMR method with MapReduce.
 ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 25, No. 2, February 2022: 1078-1086
1082
Figure 3. The Parallelized BLAST-BSPMR model
2.2.3. The steps required in aligning the BLAST-BSPMR method in parallel
a) Each Azure virtual computer node will receive the list of words and scanner, and the gene sequence
willbe aligned using a BLAST-BSPMR MapReduce Job.
b) During the Map phase, each Map task reads the scanner and a list of words from local memory to build
accurate matches for items with a high correlation; items that satisfy the matching conditions are saved as
words in the following extension.
c) Following scanning, the restored bits from second step 2) will go through three processes in
eachcomputing node: they will initiate a simple match extension without space, they will use DP to find
amatch extension with vacancies, and they will finally obtain alignment results that satisfy the
givenconditions.
d) The results of the comparison will be arranged according to the score acquired during the shuffle phase.
e) The Reduce phase transfers the genomic sequence alignment findings to Azure Blob Storage in their
original format.
2.3. Proposed BLAST alignment on the parallelized MapReduce model BLAST-BSPMR
Let 𝒟ℬ denote a bioinformatics sequence of the genome database and 𝒬𝓈 the genomic query
sequence. The 𝐵𝐿𝐴𝑆𝑇 − 𝐵𝑆𝑃𝑀𝑅 is positioned on a public cloud platform namely azure consist of a master
node, Map and Reduce worker computing nodes. The master computing node of 𝐵𝐿𝐴𝑆𝑇 − 𝐵𝑆𝑃𝑀𝑅 initializes
𝒸𝓌 Map and Reduce worker computing nodes using the virtual computing nodes. Every virtual computing
node is presumed to 𝒞CPU cores available for task/job computation. Let 𝕋𝔼𝑉𝑀_𝑖𝑛𝑧represent the time spent
initialising the virtual platform. The sequence 𝒟ℬ is a bioinformatics database that is divided into𝒟ℬ′
tiny
chunks of sequence data with overlapping portions. A database's short bit of data sequence and the 𝒬𝓈 are
then transmitted as input key, value pairs to the Map computing nodes. The key value pairings that take into
account the reference small bit of data 𝒟ℬ′
are denoted as (𝑘𝑑, 𝑣𝑑), where 𝑘𝑑 is the key and 𝑣𝑑 which
contains the genomic data with overlapping offsets. The key-value pair corresponding to the genome query
sequence 𝒬𝓈 is indicated by (𝑘𝑞𝓈, 𝑣𝑞𝓈), where 𝑘𝑞𝓈 is the key and 𝑣𝑞𝓈 is the genome query sequence. Each
of the 𝒸𝑤 computing Map workers divides the query sequence 𝒬𝓈 into 𝒬𝓈′
little bits of data and stores them in
the available local memory storage. The sequence alignment is performed using the BLASTx algorithm in
parallel using the𝒞cores taking into account 𝒟ℬ′
and each and every 𝒬𝓈′
. The BLASTx algorithm is
parallelized to reduce the time required to complete genomic sequence alignments. Let 𝕋𝔼𝕄𝕊 denote the
average time taken by the 𝒸𝓌 Map compute nodes to complete a task. The mappers give alignment positions
between 𝒬𝓈 and database small bit of data of 𝒟ℬ′
along with the computed score as a result of their post
computations. Multiple alignment places and computed score, i.e., 𝓋𝓂𝓈 combined with the small bit of data
id of 𝒟ℬ′
i.e., 𝓀𝓂𝓈 are saved in the memory in the following manner: 𝑙𝑖𝑠𝑡(𝓀𝓂𝓈, 𝓋𝓂𝓈). The BLAST-
BSPMR Mapfunction is represented as 𝑚𝑎𝑝((𝓀𝒹, 𝓋𝒹), (𝑘𝑞𝓀𝓆𝓈, 𝓋𝓆𝓈)) → 𝑙𝑖𝑠𝑡(𝓀𝓂𝓅, 𝓋𝓂𝓅).
The 𝒸𝓌 Map computing worker nodes obtain intermediate genomic data i.e.,𝑙𝑖𝑠𝑡(𝓀𝓂𝓅, 𝓋𝓂𝓅) and perform
shuffle and sort operations. The reduce step considers the collection of all non-overlapping and non-redundant
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 
An efficient and robust parallel scheduler for bioinformatics … (Leena Ammanna)
1083
alignment results, i.e., 𝑙𝑖𝑠𝑡(𝓋𝔤𝒹). Allow 𝕋𝔼ℝ𝕊 is the average time spent by the 𝒸𝓌 Reduce computing worker
nodes to complete the storage procedure (gather the results). The total time required for the BLAST-BSPMR
to align the genomic sequence using a cloud-based platform 𝒬𝓈 against the database, 𝒟ℬ is calculated as,
𝔼 = 𝕋𝔼𝑉𝑀_𝑖𝑛𝑧 + 𝕋𝔼𝕄𝕊 + 𝕋𝔼ℝ𝕊 . (1)
the BLAST-BSPMR cloud platform is depicted in Figure 3.
3. RESULTS AND DISCUSSION
The machine was configured with a 64-bit version of Windows 10 Enterprise, 16 GB of RAM, and an
i-5 quad core processor. We conducted an experimental investigation comparing the proposed BLAST-BSPMR
gene sequencing model to the existing Hadoop-BLAST sequencing model on the following parameters:
speedups, throughput, and sequence alignment completion time using the dot net framework 4.0 and C# 6.0
programming language for the proposed work and java programming language for the existing Hadoop.
It is considered to compare the performance of BLAST-BSPMR with Hadoop-BLAST. With a
single VM computing node, Hadoop-BLAST and BLAST-BSPMR can be deployed. The Microsoft Azure
cloud platform hosts the BLAST-BSPMR. The Hadoop-BLAST application is built on top of the Hadoop
MapReduce framework. The Hadoop-BLAST is deployed using Apache Hadoop & YARN 2.6.0. In the
deployments, identical configurations of VM computing nodes are taken into account. The𝐵𝐿𝐴𝑆𝑇 −
𝐵𝑆𝑃𝑀𝑅 model is hosted on Microsoft's Azure cloud considering A3 VM instances. Each A3 VM instance
consists of 4 virtual computing cores, 7 GB of RAM and 120 GB of local hard drive space. The𝐵𝐿𝐴𝑆𝑇 −
𝐵𝑆𝑃𝑀𝑅model deployed on the Azure cloud platforms are comprised of a master node and a worker node for
the purpose of performing Map and Reduce operations/tasks. Using Azure HDInsight. It enables deployment
and provisioning of Apache Hadoop clusters on the Azure cloud platform. The Apache Hadoop & YARN
version 2.6.0 is considered for performance evaluation. The master node of the azure cluster runs on the
Windows Server 2012 R2 operating system. One worker node of A3 VM instances is considered for the
Hadoop deployment. For evaluation, the non-redundant protein genomic database is used. The initial
investigations employ a continuous non-redundant protein genomic sequence and four nucleotide query
sequences of varying sizes. Thefirst experiment conducted with the reference and varied query genomic
sequences are summarized in Table 1. Figure 4 shows the execution time and Figure 5 shows the speedup
and throughput and Figure 6 shows the average speedup and throughput. The results show that the proposed
BLAST-BSPMR aligner, which is running on Azure, outperforms Hadoop-BLAST. For both BLAST-
BSPMR and Hadoop-BLAST, the execution time increases as the query file length grows. The second
experiment is conducted using varied non-redundant protein genomic sequence and constant query sequences
of nucleotide. The total time (including the Map and Reduce stages) taken to execute the alignment is
monitored using Table 2 for the BLASTx sequence alignment computation on the BLAST-BSPMR and
Hadoop-BLAST. The results obtained are summarized in Table 3. The total time required to complete
sequence alignment (as depicted in Figure 7) is recorded. The results show that the proposed BLAST-
BSPMR aligner, which is running on Azure, outperforms Hadoop-BLAST. Figure 8 shows the speedup and
throughput for the experiment l and Figure 9 depicts the average speedup and throughput obtained for the
experiment 2. As the database file size increases, the execution time also increases for both BLAST-BSPMR
and Hadoop-BLAST.
Table 1. Experiment data used to compare BLAST-performance BSPMR's to that of BLAST-Hadoop with
varied size of query genome
No Database Genome Size (GB) Query Genome Size (kb)
1 nr.01(non-redundant protein) 2.99 Nucleotide sequence 16
2 nr.01(non-redundant protein) 2.99 Nucleotide sequence 32
3 nr.01(non-redundant protein) 2.99 Nucleotide sequence 64
4 nr.01(non-redundant protein) 2.99 Nucleotide sequence 128
Table 2. Experiment data used to compare BLAST-performance BSPMR's to that of BLAST-Hadoop with
varied size of database genom
No Database genome Size (GB) Query genome Size (kb)
1 nr.40 (non-redundant protein) 0.078 Nucleotide sequence 128
2 nr.04 (non-redundant protein) 0.299 Nucleotide sequence 128
3 nr.05 (non-redundant protein) 1.075 Nucleotide sequence 128
4 nr.01 (non-redundant protein) 2.99 Nucleotide sequence 128
 ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 25, No. 2, February 2022: 1078-1086
1084
Figure 4. Sequence alignment execution time Figure 5. Speedup and throughput
Figure 6. Average speedup and throughput
Table 3. Speedup and throughput for varied query genome
Seg. File Size Speedup Throughput
16 KB 4.1 76%
32 KB 2.6 62%
64 KB 2.14 53%
128 KB 2.07 53%
Average 2.72 51%
0.078 GB 7.067 85.0%
0.299 GB 3.35 70.0%
1.075 GB 2.16 53.8%
2.99 GB 2.07 51.8%
Average 3.66 65.15%
Figure 7. Sequence alignment execution time for
varied non-redundant protein database
Figure 8. Speedup and throughput for varied
non-redundant protein database
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 
An efficient and robust parallel scheduler for bioinformatics … (Leena Ammanna)
1085
Figure 9. Average speedup and throughput non-redundant protein database
4. CONCLUSION
Aligning genomic sequences is a straightforward technique for managing and analysing
bioinformatics data. The author describes the proper methods for aligning sequences using the BLAST
algorithm, which is the most widely used tool for aligning local sequences. Currently, the BLAST algorithm
given by NCBI (stand-alone) is unable to handle dynamic biological data in the terabyte range. Cloud
platforms are used to address issues such as data storage and data-intensive computation. Biosequencing
analysis of genetic data is a critical application. In this study, the author evaluates many existing schedulers.
The existing scheduler has issues which are expensive with aligning genomic data sequences. Existing
bioinformatics sequence aligners that make use of or embrace Hadoop MapReduce suffer from the concerns
discussed in this study. In this paper, the proposed BLAST-BSPMR algorithm is employed to align
sequences. For biosequence alignment, the BLASTx algorithm is used, and a parallel MapReduce execution
approach based on Azure cloud is used in the BLAST-BSPMR cloud platform. The Map and Reduce
framework are executed in parallel to take use of the cloud computing platform's virtual machine-based
design. Additionally, the research compares the proposed BLAST-BSPMR technique to previously published
system sequence alignments. Experiments are carried out with a variety of non-redundant proteins and
nucleotide query sequence files of various sizes. Experiments are presented to demonstrate the efficiency of
the BLASTx algorithm. Through an experimental study, a comparison with Hadoop-BLAST for sequence
alignment is presented. When comparing the BLAST-BSPMR results to the Hadoop-BLAST results, the
BLAST-BSPMR results show a significant improvement. The authors plan to test their theory on a variety of
redundant databases in the future such as SwissProt protein, and REFSEQ and also run varied application
that are available in NCBI such as BLASTn, and BLASTp on our proposed BSPMR Scheduler to further
analyse the robustness and efficacy of our scheduler.
REFERENCES
[1] D. C. Marinescu, "Parallel and Distributed Computing: Memories of Time Past and a Glimpse at the Future," 2014 IEEE 13th
International Symposium on Parallel and Distributed Computing, 2014, pp. 14-15, doi: 10.1109/ISPDC.2014.33.
[2] Gartner.Inc, Gartner Forecasts Worldwide Public Cloud End-User Spending to Grow 23% in 2021, 2021. Accessed: Apr. 21,
2021. [Online]. Available: https://www.gartner.com/en/newsroom/press-releases/2021-04-21-gartner-forecasts-worldwide-public-
cloud-end-user-spending-to-grow-23-percent-in-2021
[3] W. Hassan, T. S. Chou, T. Omar, and J. Pickard, “Cloud computing survey on services, enhancements and challenges in the era of
machine learning and data science,” Int J Inf & Commun Technol., vol. 9, no. 2, pp. 117-139, 2020, doi: 10.11591/ijict.v9i2.
[4] W. Hassan, T.-S. Chou, X. Li, P. A.-Kubi, and O. Tamer, “Latest trends, challenges and solutions in security in the era of cloud
computing and software defined networks,” Int J Inf & Commun Technol., vol. 8, no. 3, pp. 162-183, 2019, doi:
10.11591/ijict.v8i3.pp162-183.
[5] P. Mehrotra, J. Djomegri, S. Heistand, and R. Hood, “Performance evaluation of Amazon EC2 for NASA HPC applications,"
Proceedings of the 3rd workshop on Scientific Cloud Computing, 2012, pp. 41-50, doi: 10.1145/2287036.2287045.
[6] A. Gupta et al., "Evaluating and Improving the Performance and Scheduling of HPC Applications in Cloud," in IEEE
Transactions on Cloud Computing, vol. 4, no. 3, pp. 307-321, 1 July-Sept. 2016, doi: 10.1109/TCC.2014.2339858.
[7] Z. Cai, X. Li, and J. N. D. Gupta, "Heuristics for Provisioning Services to Workflows in XaaS Clouds," in IEEE Transactions on
Services Computing, vol. 9, no. 2, pp. 250-263, 1 March-April 2016, doi: 10.1109/TSC.2014.2361320.
[8] M. C. Okur and M. Büyükkeçeci, "Big data challenges in information engineering curriculum," 2014 25th EAEEIE Annual
Conference (EAEEIE), 2014, pp. 1-4, doi: 10.1109/EAEEIE.2014.6879372.
[9] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51,
no. 1, pp. 107-113, 2008, doi: 10.1145/1327452.1327492.
[10] T. White, Hadoop: The definitive guide, O’Reilly Media, Inc., 2012.
[11] A. Hadoop, 2020. Accessed: Sep. 21, 2021. [Online]. Available: http://hadoop.apache.org
 ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 25, No. 2, February 2022: 1078-1086
1086
[12] J. Zhu, J. Li, E. Hardesty, H. Jiang, and K. Li, "GPU-in-Hadoop: Enabling MapReduce across distributed heterogeneous
platforms," 2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS), 2014, pp. 321-326, doi:
10.1109/ICIS.2014.6912154.
[13] D. Dahiphale et al., "An Advanced MapReduce: Cloud MapReduce, Enhancements and Applications," in IEEE Transactions on
Network and Service Management, vol. 11, no. 1, pp. 101-115, March 2014, doi: 10.1109/TNSM.2014.031714.130407.
[14] P. Kondikoppa et al., "MapReduce-Based RESTMD: Enabling Large-Scale Sampling Tasks with Distributed HPC Systems,"
2014 6th International Workshop on Science Gateways, 2014, pp. 30-35, doi: 10.1109/IWSG.2014.12.
[15] S. Tang, B. Lee, and B. He, "DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters," in
IEEE Transactions on Cloud Computing, vol. 2, no. 3, pp. 333-347, 1 July-Sept. 2014, doi: 10.1109/TCC.2014.2329299.
[16] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica "Improving MapReduce performance in heterogeneous
environments,” Osdi, vol. 8, no. 4, pp. 29-42, 2008, doi: 10.5555/1855741.1855744.
[17] Y. Tao, Q. Zhang, L. Shi, and P. Chen, "Job Scheduling Optimization for Multi-user MapReduce Clusters," 2011 Fourth
International Symposium on Parallel Architectures, Algorithms and Programming, 2011, pp. 213-217, doi:
10.1109/PAAP.2011.33.
[18] C. Kaushal and D. Koundal, “Recent Trends in Big Data using Hadoop,” International Journal of Informatics and
Communication Technology,” International Journal of Informatics and Communication Technology, vol. 8, no. 1, pp. 39-49,
2019, doi: 10.11591/ijict.v8i1.pp39-49.
[19] M. Zaharia, D. Borthakur, and J. S. Sarma, "Delay scheduling: a simple technique for achieving locality and fairness in cluster
scheduling,” Proceedings of the 5th European conference on Computer systems, 2010, pp. 265-278, doi:
10.1145/1755913.1755940.
[20] C. Wang et al., "Mapcheckreduce: an improved mapreduce computing model for imprecise applications,” IEEE International
Congress on Big Data, 2014, doi: 10.1109/BigData.Congress.2014.61.
[21] G. Malewicz, M. H. Austern, A. J. C. Bik, and J. C. Dehnert, "Pregel: a system for large-scale graph processing,” Proceedings of
the 2010 ACM SIGMOD International Conference on Management of data, 2010, pp. 135-146, doi: 10.1145/1807167.1807184.
[22] L. G. Valiant, "A bridging model for parallel computation,” Communications of the ACM, vol. 33, no. 8, pp. 103-111, 1990, doi:
10.1145/79173.79181.
[23] T. Kajdanowicz, W. Indyk, P. Kazienko, and J. Kukul, "Comparison of the Efficiency of MapReduce and Bulk Synchronous
Parallel Approaches to Large Network Processing," 2012 IEEE 12th International Conference on Data Mining Workshops, 2012,
pp. 218-225, doi: 10.1109/ICDMW.2012.135.
[24] T. Nguyen, W. Shi, and D. Ruden, "CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping,” BMC
research notes, vol. 4, no. 1, pp. 1-7, 2011, doi: 10.1186/1756-0500-4-171.
[25] M. C. Schatz, “CloudBurst: highly sensitive read mapping with MapReduce,” Bioinformatics, vol. 25, no. 11, 2009, doi:
10.1093/bioinformatics/btp236.
BIOGRAPHIES OF AUTHORS
Leena Ammanna is currently an Assistant Professor in the department of
Information Science and Engineering, SDM College of Engineering and Technology,
Dharwad, India. She received B.E. and M.Tech. Degrees from Visvesvaraya Technological
University, Karnataka, India. She has 20 years of experience in teaching and research. She
can be contacted at leenaignatiusakri@gmail.com.
Jagadeeshgowda is a Professor and Head of the Department of Computer
Science and Engineering, Shri Krishna Institute of Technology, Bengaluru, India. He has an
experience of 24 years in research, academics and industry. He received his Ph.D. from
University of JJT,India. Image processing,big data analytics, and network security are his
major research interests. He has wide publications in reputed international conferences and
journals. He has supervised four Ph.D. and 50 plus Post Graduate students. He has received
grants from Government of India. He can be contacted at hodcs.skitblr@gmail.com.
Jagadeesh Pujari is a Professor and Head of the Department of Information
Science and Engineering, SDM College of Engineering and Technology, Dharwad, India.
He has an experience of 32 years in research, academics and industry. He received his Ph.D.
from University of Gulbarga, Karnataka, India.His research interests span image processing,
pattern recognition, big data analytics, and machine learning. He has wide publications in
reputed international conferences and journals. He has supervised five Ph.D. and 80 plus
Post Graduate students. He is the investigator for several projects funded by Government of
India and Industries. He can be contacted at jaggudp@gmail.com.
Ad

More Related Content

Similar to An efficient and robust parallel scheduler for bioinformatics applications in a public cloud: A bigdata approach (20)

Multiple dag applications
Multiple dag applicationsMultiple dag applications
Multiple dag applications
csandit
 
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORSMULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
cscpconf
 
Sharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow ApplicationsSharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow Applications
ijcsit
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 
Multi-objective load balancing in cloud infrastructure through fuzzy based de...
Multi-objective load balancing in cloud infrastructure through fuzzy based de...Multi-objective load balancing in cloud infrastructure through fuzzy based de...
Multi-objective load balancing in cloud infrastructure through fuzzy based de...
IAESIJAI
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
IJECEIAES
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentQo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentQo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
NavNeet KuMar
 
Scheduling in cloud computing
Scheduling in cloud computingScheduling in cloud computing
Scheduling in cloud computing
ijccsa
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
ijesajournal
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
ijesajournal
 
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
ijesajournal
 
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
ijesajournal
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
ijesajournal
 
A novel methodology for task distribution
A novel methodology for task distributionA novel methodology for task distribution
A novel methodology for task distribution
ijesajournal
 
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET Journal
 
Ieeepro techno solutions ieee dotnet project - budget-driven scheduling alg...
Ieeepro techno solutions   ieee dotnet project - budget-driven scheduling alg...Ieeepro techno solutions   ieee dotnet project - budget-driven scheduling alg...
Ieeepro techno solutions ieee dotnet project - budget-driven scheduling alg...
ASAITHAMBIRAJAA
 
Multiple dag applications
Multiple dag applicationsMultiple dag applications
Multiple dag applications
csandit
 
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORSMULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
cscpconf
 
Sharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow ApplicationsSharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow Applications
ijcsit
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 
Multi-objective load balancing in cloud infrastructure through fuzzy based de...
Multi-objective load balancing in cloud infrastructure through fuzzy based de...Multi-objective load balancing in cloud infrastructure through fuzzy based de...
Multi-objective load balancing in cloud infrastructure through fuzzy based de...
IAESIJAI
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
IJECEIAES
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentQo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentQo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
NavNeet KuMar
 
Scheduling in cloud computing
Scheduling in cloud computingScheduling in cloud computing
Scheduling in cloud computing
ijccsa
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
ijesajournal
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
ijesajournal
 
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
ijesajournal
 
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
A Novel Methodology for Task Distribution in Heterogeneous Reconfigurable Com...
ijesajournal
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
ijesajournal
 
A novel methodology for task distribution
A novel methodology for task distributionA novel methodology for task distribution
A novel methodology for task distribution
ijesajournal
 
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET Journal
 
Ieeepro techno solutions ieee dotnet project - budget-driven scheduling alg...
Ieeepro techno solutions   ieee dotnet project - budget-driven scheduling alg...Ieeepro techno solutions   ieee dotnet project - budget-driven scheduling alg...
Ieeepro techno solutions ieee dotnet project - budget-driven scheduling alg...
ASAITHAMBIRAJAA
 

More from nooriasukmaningtyas (20)

Analysis of active islanding detection techniques for gridconnected inverters...
Analysis of active islanding detection techniques for gridconnected inverters...Analysis of active islanding detection techniques for gridconnected inverters...
Analysis of active islanding detection techniques for gridconnected inverters...
nooriasukmaningtyas
 
Influence of non-sinusoidal power supply on the performance of a single-phase...
Influence of non-sinusoidal power supply on the performance of a single-phase...Influence of non-sinusoidal power supply on the performance of a single-phase...
Influence of non-sinusoidal power supply on the performance of a single-phase...
nooriasukmaningtyas
 
Cellular automata model for emergent properties of pressure flow in single ne...
Cellular automata model for emergent properties of pressure flow in single ne...Cellular automata model for emergent properties of pressure flow in single ne...
Cellular automata model for emergent properties of pressure flow in single ne...
nooriasukmaningtyas
 
Load shifting impact on generating adequacy assessment during peak period
Load shifting impact on generating adequacy assessment during peak periodLoad shifting impact on generating adequacy assessment during peak period
Load shifting impact on generating adequacy assessment during peak period
nooriasukmaningtyas
 
Performance comparison and impact of weather conditions on different photovol...
Performance comparison and impact of weather conditions on different photovol...Performance comparison and impact of weather conditions on different photovol...
Performance comparison and impact of weather conditions on different photovol...
nooriasukmaningtyas
 
Emergency congestion management of power systems by static synchronous series...
Emergency congestion management of power systems by static synchronous series...Emergency congestion management of power systems by static synchronous series...
Emergency congestion management of power systems by static synchronous series...
nooriasukmaningtyas
 
Analysis and comparison of a proposed mutation operator and its effects on th...
Analysis and comparison of a proposed mutation operator and its effects on th...Analysis and comparison of a proposed mutation operator and its effects on th...
Analysis and comparison of a proposed mutation operator and its effects on th...
nooriasukmaningtyas
 
Social cyber-criminal, towards automatic real time recognition of malicious p...
Social cyber-criminal, towards automatic real time recognition of malicious p...Social cyber-criminal, towards automatic real time recognition of malicious p...
Social cyber-criminal, towards automatic real time recognition of malicious p...
nooriasukmaningtyas
 
Music genres classification by deep learning
Music genres classification by deep learningMusic genres classification by deep learning
Music genres classification by deep learning
nooriasukmaningtyas
 
Predicting students' learning styles using regression techniques
Predicting students' learning styles using regression techniquesPredicting students' learning styles using regression techniques
Predicting students' learning styles using regression techniques
nooriasukmaningtyas
 
Soft computing techniques for early diabetes prediction
Soft computing techniques for early diabetes predictionSoft computing techniques for early diabetes prediction
Soft computing techniques for early diabetes prediction
nooriasukmaningtyas
 
Different analytical frameworks and bigdata model for Internet of Things
Different analytical frameworks and bigdata model for Internet of ThingsDifferent analytical frameworks and bigdata model for Internet of Things
Different analytical frameworks and bigdata model for Internet of Things
nooriasukmaningtyas
 
Network intrusion detection system: machine learning approach
Network intrusion detection system: machine learning approachNetwork intrusion detection system: machine learning approach
Network intrusion detection system: machine learning approach
nooriasukmaningtyas
 
An unsupervised generative adversarial network based-host intrusion detection...
An unsupervised generative adversarial network based-host intrusion detection...An unsupervised generative adversarial network based-host intrusion detection...
An unsupervised generative adversarial network based-host intrusion detection...
nooriasukmaningtyas
 
Crime prediction using a hybrid sentiment analysis approach based on the bidi...
Crime prediction using a hybrid sentiment analysis approach based on the bidi...Crime prediction using a hybrid sentiment analysis approach based on the bidi...
Crime prediction using a hybrid sentiment analysis approach based on the bidi...
nooriasukmaningtyas
 
Hybrid dynamic chunk ensemble model for multi-class data streams
Hybrid dynamic chunk ensemble model for multi-class data streamsHybrid dynamic chunk ensemble model for multi-class data streams
Hybrid dynamic chunk ensemble model for multi-class data streams
nooriasukmaningtyas
 
Chaotic elliptic map for speech encryption
Chaotic elliptic map for speech encryptionChaotic elliptic map for speech encryption
Chaotic elliptic map for speech encryption
nooriasukmaningtyas
 
Efficient processing of continuous spatial-textual queries over geo-textual d...
Efficient processing of continuous spatial-textual queries over geo-textual d...Efficient processing of continuous spatial-textual queries over geo-textual d...
Efficient processing of continuous spatial-textual queries over geo-textual d...
nooriasukmaningtyas
 
Modular reduction with step-by-step using of several bits of the reducible nu...
Modular reduction with step-by-step using of several bits of the reducible nu...Modular reduction with step-by-step using of several bits of the reducible nu...
Modular reduction with step-by-step using of several bits of the reducible nu...
nooriasukmaningtyas
 
Design and implementation of an S-band transmitter for nanosatellites with ne...
Design and implementation of an S-band transmitter for nanosatellites with ne...Design and implementation of an S-band transmitter for nanosatellites with ne...
Design and implementation of an S-band transmitter for nanosatellites with ne...
nooriasukmaningtyas
 
Analysis of active islanding detection techniques for gridconnected inverters...
Analysis of active islanding detection techniques for gridconnected inverters...Analysis of active islanding detection techniques for gridconnected inverters...
Analysis of active islanding detection techniques for gridconnected inverters...
nooriasukmaningtyas
 
Influence of non-sinusoidal power supply on the performance of a single-phase...
Influence of non-sinusoidal power supply on the performance of a single-phase...Influence of non-sinusoidal power supply on the performance of a single-phase...
Influence of non-sinusoidal power supply on the performance of a single-phase...
nooriasukmaningtyas
 
Cellular automata model for emergent properties of pressure flow in single ne...
Cellular automata model for emergent properties of pressure flow in single ne...Cellular automata model for emergent properties of pressure flow in single ne...
Cellular automata model for emergent properties of pressure flow in single ne...
nooriasukmaningtyas
 
Load shifting impact on generating adequacy assessment during peak period
Load shifting impact on generating adequacy assessment during peak periodLoad shifting impact on generating adequacy assessment during peak period
Load shifting impact on generating adequacy assessment during peak period
nooriasukmaningtyas
 
Performance comparison and impact of weather conditions on different photovol...
Performance comparison and impact of weather conditions on different photovol...Performance comparison and impact of weather conditions on different photovol...
Performance comparison and impact of weather conditions on different photovol...
nooriasukmaningtyas
 
Emergency congestion management of power systems by static synchronous series...
Emergency congestion management of power systems by static synchronous series...Emergency congestion management of power systems by static synchronous series...
Emergency congestion management of power systems by static synchronous series...
nooriasukmaningtyas
 
Analysis and comparison of a proposed mutation operator and its effects on th...
Analysis and comparison of a proposed mutation operator and its effects on th...Analysis and comparison of a proposed mutation operator and its effects on th...
Analysis and comparison of a proposed mutation operator and its effects on th...
nooriasukmaningtyas
 
Social cyber-criminal, towards automatic real time recognition of malicious p...
Social cyber-criminal, towards automatic real time recognition of malicious p...Social cyber-criminal, towards automatic real time recognition of malicious p...
Social cyber-criminal, towards automatic real time recognition of malicious p...
nooriasukmaningtyas
 
Music genres classification by deep learning
Music genres classification by deep learningMusic genres classification by deep learning
Music genres classification by deep learning
nooriasukmaningtyas
 
Predicting students' learning styles using regression techniques
Predicting students' learning styles using regression techniquesPredicting students' learning styles using regression techniques
Predicting students' learning styles using regression techniques
nooriasukmaningtyas
 
Soft computing techniques for early diabetes prediction
Soft computing techniques for early diabetes predictionSoft computing techniques for early diabetes prediction
Soft computing techniques for early diabetes prediction
nooriasukmaningtyas
 
Different analytical frameworks and bigdata model for Internet of Things
Different analytical frameworks and bigdata model for Internet of ThingsDifferent analytical frameworks and bigdata model for Internet of Things
Different analytical frameworks and bigdata model for Internet of Things
nooriasukmaningtyas
 
Network intrusion detection system: machine learning approach
Network intrusion detection system: machine learning approachNetwork intrusion detection system: machine learning approach
Network intrusion detection system: machine learning approach
nooriasukmaningtyas
 
An unsupervised generative adversarial network based-host intrusion detection...
An unsupervised generative adversarial network based-host intrusion detection...An unsupervised generative adversarial network based-host intrusion detection...
An unsupervised generative adversarial network based-host intrusion detection...
nooriasukmaningtyas
 
Crime prediction using a hybrid sentiment analysis approach based on the bidi...
Crime prediction using a hybrid sentiment analysis approach based on the bidi...Crime prediction using a hybrid sentiment analysis approach based on the bidi...
Crime prediction using a hybrid sentiment analysis approach based on the bidi...
nooriasukmaningtyas
 
Hybrid dynamic chunk ensemble model for multi-class data streams
Hybrid dynamic chunk ensemble model for multi-class data streamsHybrid dynamic chunk ensemble model for multi-class data streams
Hybrid dynamic chunk ensemble model for multi-class data streams
nooriasukmaningtyas
 
Chaotic elliptic map for speech encryption
Chaotic elliptic map for speech encryptionChaotic elliptic map for speech encryption
Chaotic elliptic map for speech encryption
nooriasukmaningtyas
 
Efficient processing of continuous spatial-textual queries over geo-textual d...
Efficient processing of continuous spatial-textual queries over geo-textual d...Efficient processing of continuous spatial-textual queries over geo-textual d...
Efficient processing of continuous spatial-textual queries over geo-textual d...
nooriasukmaningtyas
 
Modular reduction with step-by-step using of several bits of the reducible nu...
Modular reduction with step-by-step using of several bits of the reducible nu...Modular reduction with step-by-step using of several bits of the reducible nu...
Modular reduction with step-by-step using of several bits of the reducible nu...
nooriasukmaningtyas
 
Design and implementation of an S-band transmitter for nanosatellites with ne...
Design and implementation of an S-band transmitter for nanosatellites with ne...Design and implementation of an S-band transmitter for nanosatellites with ne...
Design and implementation of an S-band transmitter for nanosatellites with ne...
nooriasukmaningtyas
 
Ad

Recently uploaded (20)

ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxbMain cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
SunilSingh610661
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
New Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdfNew Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdf
mohamedezzat18803
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxbMain cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
SunilSingh610661
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
New Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdfNew Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdf
mohamedezzat18803
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
Ad

An efficient and robust parallel scheduler for bioinformatics applications in a public cloud: A bigdata approach

  • 1. Indonesian Journal of Electrical Engineering and Computer Science Vol. 25, No. 2, February 2022, pp. 1078~1086 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v25.i2.pp1078-1086  1078 Journal homepage: http://ijeecs.iaescore.com An efficient and robust parallel scheduler for bioinformatics applications in a public cloud: A bigdata approach Leena Ammanna1 , Jagadeeshgowda2 , Jagadeesh Pujari1 1 Department of Information Science and Engineering, SDM College of Engineering and Technology, Dharwad, India 2 Department of Computer Science and Engineering, Shri Krishna Institute of Technology, Bengaluru, India Article Info ABSTRACT Article history: Received Jul 7, 2021 Revised Dec 21, 2021 Accepted Dec 29, 2021 In bioinformatics, genomic sequence alignment is a simple method for handling and analysing data, and it is one of the most important applications in determining the structure and function of protein sequences and nucleic acids. The basic local alignment search tool (BLAST) algorithm, which is one of the most frequently used local sequence alignment algorithms, is covered in detail here. Currently, the NCBI's BLAST algorithm (stand- alone) is unable to handle biological data in the terabytes. To address this problem, a variety of schedulers have been proposed. Existing sequencing approaches are based on the Hadoop MapReduce (MR) framework, which enables a diverse set of applications and employs a serial execution strategy that takes a long time and consumes a lot of computing resources. The author, improves the BLAST algorithm based on the BLAST-BSPMR algorithm to achieve the BLAST algorithm. To address the issue with Hadoop's MapReduce framework, a customised MapReduce framework is developed on the Azure cloud platform. The experiment findings indicate that the suggested bulk synchronous parallel MapReduce-basic local alignment search tool (BSPMR-BLAST) algorithm matches bioinformatics genomic sequences more quickly than the existing Hadoop-BLAST method, and that the proposed customised scheduler is extremely stable and scalable. Keywords: Bioinformatics genomic sequence Hadoop Microsoft azure Scheduler parallelization This is an open access article under the CC BY-SA license. Corresponding Author: Jagadeeshgowda Department of Computer Science and Engineering, Shri Krishna Institute of Technology Bengaluru, India Email: [email protected] 1. INTRODUCTION According to research [1], [2], cloud computing has established itself as the future norm for data- intensive computing. According to [3], [4] cloud computing platforms allow on-demand access to shared, scalable, fault-tolerant, and reconfigurable computing resources with little administrative work and a low cost. In comparison to freestanding private computer clusters [5], [6] demonstrate how cloud computing platforms are a commonly desired and recognised method for running large data applications or high- performance computations (HPC). The cloud's resource management, virtual computing platforms, and elasticity all contribute to the ease with which data-intensive applications can be migrated to the cloud paradigm. Cai et al. [7], Okur and Büyükkeçeci [8] note that the frameworks required to enable the most efficient use of cloud resources at the lowest possible cost for data-intensive applications remain unresolved. Google’s MapReduce framework discussed in [9] is a commonly used approach for performing distributed computations on the cloud. White [10] and Hadoop [11] present that implementations of Hadoop are based on the MapReduce architecture and are capable of supporting large-scale data applications. The MapReduce model employs a two-stage execution mechanism. The initial stage entails splitting or dividing the data to be
  • 2. Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752  An efficient and robust parallel scheduler for bioinformatics … (Leena Ammanna) 1079 processed into little pieces. Each small bit of data is associated with a mapper. The mapper outputs ⟨𝐾𝑒𝑦|𝑉𝑎𝑙𝑢𝑒⟩ pairs that are arranged according to the 𝐾𝑒𝑦values. The Reduce workers are provided with the sorted values using the ⟨𝐾𝑒𝑦|𝑆𝑜𝑟𝑡𝑒𝑑𝐿𝑖𝑠𝑡(𝑉𝑎𝑙𝑢𝑒)⟩ method. The Hadoop distributed file system is used to store the results of the reduce workers hadoop distributed file system (HDFS). In most public cloud setups, the Map and Reduce workers are virtual machines (VMs). The researchers recognised the disadvantages of Hadoop MapReduce and examined several improvement strategies. Zhu et al. [12] discusses the use of Hadoop with CUDA to increase processing capacity and execution speed. Zhu et al. [12] proposes a research paradigm that facilitates appropriate use of graphics processing units (GPU). Dahiphale et al. [13] presents a cloud-based MapReduce model. Pipelining is used to improve execution performance and to enable elastic pricing. For scientific applications [14] proposed a framework based on parallel tempering called replica exchange statistical temperature molecular dynamics. Tang et al. [15] describe a unique dynamic Hadoop slot allocation technique for resolving the Hadoop resource provisioning problem that is not optimal. The result reported in Tang et al. [15] demonstrates that their technique significantly improves Hadoop's performance when handling many MapReduce jobs. The most often utilised technique is to increase execution effectiveness through scheduling practises, as described in [16]-[19]. Wang et al. [20] presented a method for computing "Imprecise Applications" using MR frameworks. Reduce is run after the Map step in predictable MapReduce applications. In the case of inaccurate applications, the Reduce stage might be initiated using the Map stage's partial findings. Applications such as word frequency statistics and hot-word identification are considered imprecise. Executing inaccurate apps on well- known machine learning frameworks like Hadoop introduces latency. Running erroneous apps might result in additional expenditures on public cloud systems, where users are charged for all computations and storage services consumed. To address this shortcoming incorporated the check phase into the MapReduce framework in order to reduce costs and execution latencies [20]. In MR frameworks, iterative applications and certain graph-based applications performed poorly. Google introduced the Pregel framework for cloud computations in [21] for similar applications. Pregel is built on valiant's bulk synchronous parallel (BSP) computation model [22]. Kajdanowicz et al. [23] demonstrate that the Pregel framework outperforms MapReduce when calculating graph-based applications. Apart from the serial execution mechanism utilised in MR, several concerns remain unresolved, including node failure management capabilities, scheduling strategies, and multiway join processes. The authors provide a framework for parallel computation MapReduce (PMR) in public cloud environments in this article. The MapReduce architecture of the BLAST-BSPMR enables parallel computation to accelerate the execution of the BSP model. The BLAST-BSPMR implements Map and Reduce workers by utilising Microsoft Azure VM computing environments. Multicore processors enable parallel computation in these virtualized computing environments. The BLAST-BSPMR proposal leverages this parallel execution feature to drastically lower the Map and Reduce worker nodes' computation times. In comparison to other MapReduce frameworks like Hadoop, the BLAST-Reduce BSPMR's phase is initiated when two or more worker nodes have completed their jobs. The BLAST-BSPMR function shown here is used to calculate the results of bioinformatics BLASTx applications. Several strategies and scheduler are proposed by researchers to improve the performance of the MapReduce cloud computing framework proposed by Google in Dean et al. [9]. The work presented in Zhu, et al. [12] endures the closest similarity to our work presented here. Using CUDA codes the Map and Reduce worker tasks execute parallely on the GPU’s is achieved. The integration of Hadoop and GPU is achieved using the Hadoop Streaming, Pipes, JCuda and JNI approaches. The experimental investigation given in Jie Zhu et al. [12] demonstrates the effectiveness of the suggested approach on a private heterogeneous cloud environment using the word count application. The execution efficiency of the JCuda approach over Hadoop Streaming, Pipes and JNI is also proved. The major drawback of this approach is that such computational models are not suited for public cloud environments as GPU based VM environments aregenerally not offered. Public cloud environments like Amazon EC2 that offer such GPU based virtualized computing environments provide it at very high costs. Dahiphale et al. [13] described the drawbacks of the conventional MapReduce frameworks as followsthe MapReduce framework adopts a sequential processing of the Map and Reduce stages. The scalability of the MapReduce is limited. The MapReduce framework provides no support for flexible pricing options. The MapReduce model provides no support for computing streaming data. To overcome these drawbacks a pipelined model is presented to parallelize the execution of the Map and Reduce phase. The MapReduce model proposed in [13] is realized on the Amazon public cloud. The spot instance offering of the Amazon cloud allows flexible pricing. The experimental study given here demonstrates the efficiency of the pipelining-based MapReduce model when compared to the conventional MapReduce model using the word count application. The major drawback of the model proposed in [13] is that the locality optimization is not
  • 3.  ISSN: 2502-4752 Indonesian J Elec Eng & Comp Sci, Vol. 25, No. 2, February 2022: 1078-1086 1080 considered and hosting of additional data dependent applications like Smith Waterman and other bioinformatics application cannot be executed in a pipelined fashion. The computation of certain data intensive applications like graph applications and iterative applications on MapReduce frameworks exhibit high computation time and cost. To support such applications, Google presented a proprietary cloud computing framework named Pregel that embraces the 𝐵𝑆𝑃 computing model [21]. In Pregel, the graph computations are achieved using a set of super-steps. A super step is used to execute the user defined application or function in a parallel fashion using the data item from the database. Each data item from the database behaves as an agent. The Pregel system adopts vertex-centric execution strategy. The computation of each data item has a graph like representation in 𝐵𝑆𝑃. The vertexes in the Pregel deactivate post the computation operation and are reactivated only if additional data items are presented to them. Once all the vertices are deactivated the computation is said to be complete. The local storage of the data items in the nodes executing the computation poses a problem. In the case the data item is large then a spilling- to-disk technique needs to be in [23]. The MapReduce framework is been adopted by the Hadoop framework in Nguyen et al. [24] for computing on the cloud platform. The MapReduce paradigm employs a two-phase technique. The first phase divides the input data into little bits of data to be processed. Each small bit of data is associated with a mapper. The mapper outputs a<Key, Value> pair that is sorted according to the Key values. Thereducer takes these sorted values as the <Key│SortedList (Value)>. These sorted values are stored in the Hadoop Distributed File System. Schatz [25] and CloudBurst make use of the Hadoop MapReduce framework as the computing platform. These alignment tools perform effectively for small base pair alignments requiring single-gap or un- gapped alignment. However, when huge base pairs are considered, these aligners perform poorly. Because all present cloud framework sequence aligners take the Hadoop framework into account, Hadoop suffers when iterative applications are hosted on the cloud framework. When considering multiway joins, Hadoop performs poorly. Since Hadoop framework executes sequentially, again there is degradation in the performance. The 𝐵𝑆𝑃 based Pregel framework exhibits lower computation time for selective applications. In this paper the 𝐵𝐿𝐴𝑆𝑇 − 𝐵𝑆𝑃𝑀𝑅 framework incorporates the MapReduce architecture and an execution strategy is conducted in parallel fashion as observed in the 𝐵𝑆𝑃. 2. METHOD A cloud platform, BLAST-BSPMR is proposed that allows to apply bioinformatics application like gene sequencing. The BLAST-BSPMR uses MapReduce framework which works on a cloud computing environment. Azure VM is the Azure infrastructure as service (IaaS) which is used to deploy persistent VMs. Thus, the Map and Reduce worker nodes in BLAST-BSPMR are deployed on Azure VMs. Alignment of genomic sequences is performed using the BLASTxmethod. Here in BLASTx, a DNA query is compared to a protein database. The BLAST-BSPMR algorithm performs sequence alignment in two stages, namely the Map and Reduce phases. SinceHadoop framework is sequential, only when the Map phase is completed the Reduce phase gets executed. To prevent sequential execution, BLAST-BSPMR considers parallelizing the Map and Reduce phases. The Map and Reduce functions are meant to be executed in parallel and to make optimal use of the cores available in the worker VMs. 2.1. Basic local alignment search tool (BLAST) algorithm Basic local alignment search tool (BLAST) is used to compare a query sequence with the database sequence to find similar sequences in the database. The BLAST consists of various sequence searching programmes (including BLASTn, tBLASTn, BLASTx, BLASTp, tBLASTp, and tBLASTx). Each application varies in its functionalities, but the core of the algorithm for all of them is basically the same. The BLAST algorithm consists of three steps. In the initial step, called as building the word list, the query sequence is split intosmall words of fixed size (w-mers), often w being 11. Therefore, if the length of the query sequence is n, then list of n-w+1 words are being constructed. This step is illustrated in Figure 1. The second stage, referred to as the scan for hits, is where the BLAST algorithm looks for matches to the words (referred to as hits) in the database sequence. The BLAST algorithm employs the 'two-hit' strategy. If two non-overlapping word pairs are found to be within a distance D of one another, an extension is initiated. The next stage, dubbed extend the hits, involves the BLAST algorithm expanding the results in order to find longer related segment pairs. When the "two-hit" requirement is met, a gap-free extension in two directions is performed to locate an alignment known as a high-scoring segment pair (HSP). Figure 2 gives an example of gap-free extension. This is the seed-and-extend heuristic strategy for identifying high-scoring gene sequence alignments between the genomic query sequence and the database's genomic sequences.
  • 4. Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752  An efficient and robust parallel scheduler for bioinformatics … (Leena Ammanna) 1081 Figure 1. Build word list Figure 2. Build word list 2.2. BLASTon the proposed parallelized MapReduce framework The genomic sequence database was initially split into trivial chunks of static size and scheduled over the virtual Azure computing nodes in the current work. Numerous computing nodes can process the chunks in parallel. The three core BLAST processes (building word lists, scanning, and extension) were then reorganised and run concurrently. It takes longer in two stages, according to the BLAST algorithm: searching for matching terms in the genome database and extending the seeds. Thus, the proposed BLAST-BSPMR framework is employed to parallelize the BLAST algorithm, thereby increasing the calculation's overall efficiency. The BLAST-BSPMR technique is implemented in three stages: data storage in Azure blob storage, genomic sequence preparation, and genomic sequence parallelization. The first stage involves transferring the genomic database and query sequence to Azure blob storage. The second level preprocesses each computational worker node's genetic data. The third stage performs further accurate matches, extension, and statistics on the preprocessed sequence seeds prior to giving them as input to the BLAST-BSPMR algorithm. The sections that follow describe the suggested method's several stages. 2.2.1. Pre-processing of gene sequence data As stated in the basic procedure of the NCBI BLAST algorithm, the first criterion the BLAST algorithm is to generate a word list for the BLAST-BSPMR method. The purpose of identifying words with a high correlation coefficient during the pre-processing of genomic sequence data is to simplify calculations or computations concerning accurate/correct matches and seed expansion. This technique accelerates the third phase of sequence alignment. Due to the fact that the research query data consists of nucleic acid sequences, the lengths of the genomic query and database sequences are x and y, respectively, and the number of words is x-10 and y-10. The following stages must be followed in the pre-processing of genomic sequence data: Stages of mapping, sorting, shuffle, and reduction. The author generates a statistic of nucleobases (ATGC) with the highest occurrence number using a Map scheme on each word in the Map stage, such as <T,4>, where T indicates the most often occurring nucleobase and 4 denotes the number. The output of Map is sorted by the frequency of each nucleobase during the sort stage. In the shuffle phase, the sort stage's results are jumbled and merged. If the frequency of nucleobase T in the word𝑞𝑠𝑖(i = 0, 11, … . , 10y + 1, … . ) in the genomic query sequence is num1 and the count of nucleobase T in the word𝑑𝑏𝑖(i = 0, 11, … . , 10y + 1, … . ) in the genomic database sequence is num2, 𝑞𝑠𝑖will combine with all𝑑𝑏𝑖, and rank them according to the size of num1+num2. List of all𝑑𝑏𝑖combined with𝑞𝑠𝑖as items in the BLAST-BSPMR algorithm's word list in the final stage of Reduce, based on the output of the shuffle stage. The following step is to save the results to Azure Blob Storage for later analysis. 2.2.2. Alignment of gene sequences in parallel The word list for the scanner is produced from preprocessed genomic sequencing data. The three basic steps in the parallel alignment of genomic sequences using a word list and a scanner are exact word matches, seed expansion, and statistical significance during seed expansion. To facilitate parallelization of genomic sequence alignment, the generated word list and scanner should be distributed among a large number of virtual processing nodes. The flow chart in Figure 3 depicts the parallelization of the BLAST- BSPMR method with MapReduce.
  • 5.  ISSN: 2502-4752 Indonesian J Elec Eng & Comp Sci, Vol. 25, No. 2, February 2022: 1078-1086 1082 Figure 3. The Parallelized BLAST-BSPMR model 2.2.3. The steps required in aligning the BLAST-BSPMR method in parallel a) Each Azure virtual computer node will receive the list of words and scanner, and the gene sequence willbe aligned using a BLAST-BSPMR MapReduce Job. b) During the Map phase, each Map task reads the scanner and a list of words from local memory to build accurate matches for items with a high correlation; items that satisfy the matching conditions are saved as words in the following extension. c) Following scanning, the restored bits from second step 2) will go through three processes in eachcomputing node: they will initiate a simple match extension without space, they will use DP to find amatch extension with vacancies, and they will finally obtain alignment results that satisfy the givenconditions. d) The results of the comparison will be arranged according to the score acquired during the shuffle phase. e) The Reduce phase transfers the genomic sequence alignment findings to Azure Blob Storage in their original format. 2.3. Proposed BLAST alignment on the parallelized MapReduce model BLAST-BSPMR Let 𝒟ℬ denote a bioinformatics sequence of the genome database and 𝒬𝓈 the genomic query sequence. The 𝐵𝐿𝐴𝑆𝑇 − 𝐵𝑆𝑃𝑀𝑅 is positioned on a public cloud platform namely azure consist of a master node, Map and Reduce worker computing nodes. The master computing node of 𝐵𝐿𝐴𝑆𝑇 − 𝐵𝑆𝑃𝑀𝑅 initializes 𝒸𝓌 Map and Reduce worker computing nodes using the virtual computing nodes. Every virtual computing node is presumed to 𝒞CPU cores available for task/job computation. Let 𝕋𝔼𝑉𝑀_𝑖𝑛𝑧represent the time spent initialising the virtual platform. The sequence 𝒟ℬ is a bioinformatics database that is divided into𝒟ℬ′ tiny chunks of sequence data with overlapping portions. A database's short bit of data sequence and the 𝒬𝓈 are then transmitted as input key, value pairs to the Map computing nodes. The key value pairings that take into account the reference small bit of data 𝒟ℬ′ are denoted as (𝑘𝑑, 𝑣𝑑), where 𝑘𝑑 is the key and 𝑣𝑑 which contains the genomic data with overlapping offsets. The key-value pair corresponding to the genome query sequence 𝒬𝓈 is indicated by (𝑘𝑞𝓈, 𝑣𝑞𝓈), where 𝑘𝑞𝓈 is the key and 𝑣𝑞𝓈 is the genome query sequence. Each of the 𝒸𝑤 computing Map workers divides the query sequence 𝒬𝓈 into 𝒬𝓈′ little bits of data and stores them in the available local memory storage. The sequence alignment is performed using the BLASTx algorithm in parallel using the𝒞cores taking into account 𝒟ℬ′ and each and every 𝒬𝓈′ . The BLASTx algorithm is parallelized to reduce the time required to complete genomic sequence alignments. Let 𝕋𝔼𝕄𝕊 denote the average time taken by the 𝒸𝓌 Map compute nodes to complete a task. The mappers give alignment positions between 𝒬𝓈 and database small bit of data of 𝒟ℬ′ along with the computed score as a result of their post computations. Multiple alignment places and computed score, i.e., 𝓋𝓂𝓈 combined with the small bit of data id of 𝒟ℬ′ i.e., 𝓀𝓂𝓈 are saved in the memory in the following manner: 𝑙𝑖𝑠𝑡(𝓀𝓂𝓈, 𝓋𝓂𝓈). The BLAST- BSPMR Mapfunction is represented as 𝑚𝑎𝑝((𝓀𝒹, 𝓋𝒹), (𝑘𝑞𝓀𝓆𝓈, 𝓋𝓆𝓈)) → 𝑙𝑖𝑠𝑡(𝓀𝓂𝓅, 𝓋𝓂𝓅). The 𝒸𝓌 Map computing worker nodes obtain intermediate genomic data i.e.,𝑙𝑖𝑠𝑡(𝓀𝓂𝓅, 𝓋𝓂𝓅) and perform shuffle and sort operations. The reduce step considers the collection of all non-overlapping and non-redundant
  • 6. Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752  An efficient and robust parallel scheduler for bioinformatics … (Leena Ammanna) 1083 alignment results, i.e., 𝑙𝑖𝑠𝑡(𝓋𝔤𝒹). Allow 𝕋𝔼ℝ𝕊 is the average time spent by the 𝒸𝓌 Reduce computing worker nodes to complete the storage procedure (gather the results). The total time required for the BLAST-BSPMR to align the genomic sequence using a cloud-based platform 𝒬𝓈 against the database, 𝒟ℬ is calculated as, 𝔼 = 𝕋𝔼𝑉𝑀_𝑖𝑛𝑧 + 𝕋𝔼𝕄𝕊 + 𝕋𝔼ℝ𝕊 . (1) the BLAST-BSPMR cloud platform is depicted in Figure 3. 3. RESULTS AND DISCUSSION The machine was configured with a 64-bit version of Windows 10 Enterprise, 16 GB of RAM, and an i-5 quad core processor. We conducted an experimental investigation comparing the proposed BLAST-BSPMR gene sequencing model to the existing Hadoop-BLAST sequencing model on the following parameters: speedups, throughput, and sequence alignment completion time using the dot net framework 4.0 and C# 6.0 programming language for the proposed work and java programming language for the existing Hadoop. It is considered to compare the performance of BLAST-BSPMR with Hadoop-BLAST. With a single VM computing node, Hadoop-BLAST and BLAST-BSPMR can be deployed. The Microsoft Azure cloud platform hosts the BLAST-BSPMR. The Hadoop-BLAST application is built on top of the Hadoop MapReduce framework. The Hadoop-BLAST is deployed using Apache Hadoop & YARN 2.6.0. In the deployments, identical configurations of VM computing nodes are taken into account. The𝐵𝐿𝐴𝑆𝑇 − 𝐵𝑆𝑃𝑀𝑅 model is hosted on Microsoft's Azure cloud considering A3 VM instances. Each A3 VM instance consists of 4 virtual computing cores, 7 GB of RAM and 120 GB of local hard drive space. The𝐵𝐿𝐴𝑆𝑇 − 𝐵𝑆𝑃𝑀𝑅model deployed on the Azure cloud platforms are comprised of a master node and a worker node for the purpose of performing Map and Reduce operations/tasks. Using Azure HDInsight. It enables deployment and provisioning of Apache Hadoop clusters on the Azure cloud platform. The Apache Hadoop & YARN version 2.6.0 is considered for performance evaluation. The master node of the azure cluster runs on the Windows Server 2012 R2 operating system. One worker node of A3 VM instances is considered for the Hadoop deployment. For evaluation, the non-redundant protein genomic database is used. The initial investigations employ a continuous non-redundant protein genomic sequence and four nucleotide query sequences of varying sizes. Thefirst experiment conducted with the reference and varied query genomic sequences are summarized in Table 1. Figure 4 shows the execution time and Figure 5 shows the speedup and throughput and Figure 6 shows the average speedup and throughput. The results show that the proposed BLAST-BSPMR aligner, which is running on Azure, outperforms Hadoop-BLAST. For both BLAST- BSPMR and Hadoop-BLAST, the execution time increases as the query file length grows. The second experiment is conducted using varied non-redundant protein genomic sequence and constant query sequences of nucleotide. The total time (including the Map and Reduce stages) taken to execute the alignment is monitored using Table 2 for the BLASTx sequence alignment computation on the BLAST-BSPMR and Hadoop-BLAST. The results obtained are summarized in Table 3. The total time required to complete sequence alignment (as depicted in Figure 7) is recorded. The results show that the proposed BLAST- BSPMR aligner, which is running on Azure, outperforms Hadoop-BLAST. Figure 8 shows the speedup and throughput for the experiment l and Figure 9 depicts the average speedup and throughput obtained for the experiment 2. As the database file size increases, the execution time also increases for both BLAST-BSPMR and Hadoop-BLAST. Table 1. Experiment data used to compare BLAST-performance BSPMR's to that of BLAST-Hadoop with varied size of query genome No Database Genome Size (GB) Query Genome Size (kb) 1 nr.01(non-redundant protein) 2.99 Nucleotide sequence 16 2 nr.01(non-redundant protein) 2.99 Nucleotide sequence 32 3 nr.01(non-redundant protein) 2.99 Nucleotide sequence 64 4 nr.01(non-redundant protein) 2.99 Nucleotide sequence 128 Table 2. Experiment data used to compare BLAST-performance BSPMR's to that of BLAST-Hadoop with varied size of database genom No Database genome Size (GB) Query genome Size (kb) 1 nr.40 (non-redundant protein) 0.078 Nucleotide sequence 128 2 nr.04 (non-redundant protein) 0.299 Nucleotide sequence 128 3 nr.05 (non-redundant protein) 1.075 Nucleotide sequence 128 4 nr.01 (non-redundant protein) 2.99 Nucleotide sequence 128
  • 7.  ISSN: 2502-4752 Indonesian J Elec Eng & Comp Sci, Vol. 25, No. 2, February 2022: 1078-1086 1084 Figure 4. Sequence alignment execution time Figure 5. Speedup and throughput Figure 6. Average speedup and throughput Table 3. Speedup and throughput for varied query genome Seg. File Size Speedup Throughput 16 KB 4.1 76% 32 KB 2.6 62% 64 KB 2.14 53% 128 KB 2.07 53% Average 2.72 51% 0.078 GB 7.067 85.0% 0.299 GB 3.35 70.0% 1.075 GB 2.16 53.8% 2.99 GB 2.07 51.8% Average 3.66 65.15% Figure 7. Sequence alignment execution time for varied non-redundant protein database Figure 8. Speedup and throughput for varied non-redundant protein database
  • 8. Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752  An efficient and robust parallel scheduler for bioinformatics … (Leena Ammanna) 1085 Figure 9. Average speedup and throughput non-redundant protein database 4. CONCLUSION Aligning genomic sequences is a straightforward technique for managing and analysing bioinformatics data. The author describes the proper methods for aligning sequences using the BLAST algorithm, which is the most widely used tool for aligning local sequences. Currently, the BLAST algorithm given by NCBI (stand-alone) is unable to handle dynamic biological data in the terabyte range. Cloud platforms are used to address issues such as data storage and data-intensive computation. Biosequencing analysis of genetic data is a critical application. In this study, the author evaluates many existing schedulers. The existing scheduler has issues which are expensive with aligning genomic data sequences. Existing bioinformatics sequence aligners that make use of or embrace Hadoop MapReduce suffer from the concerns discussed in this study. In this paper, the proposed BLAST-BSPMR algorithm is employed to align sequences. For biosequence alignment, the BLASTx algorithm is used, and a parallel MapReduce execution approach based on Azure cloud is used in the BLAST-BSPMR cloud platform. The Map and Reduce framework are executed in parallel to take use of the cloud computing platform's virtual machine-based design. Additionally, the research compares the proposed BLAST-BSPMR technique to previously published system sequence alignments. Experiments are carried out with a variety of non-redundant proteins and nucleotide query sequence files of various sizes. Experiments are presented to demonstrate the efficiency of the BLASTx algorithm. Through an experimental study, a comparison with Hadoop-BLAST for sequence alignment is presented. When comparing the BLAST-BSPMR results to the Hadoop-BLAST results, the BLAST-BSPMR results show a significant improvement. The authors plan to test their theory on a variety of redundant databases in the future such as SwissProt protein, and REFSEQ and also run varied application that are available in NCBI such as BLASTn, and BLASTp on our proposed BSPMR Scheduler to further analyse the robustness and efficacy of our scheduler. REFERENCES [1] D. C. Marinescu, "Parallel and Distributed Computing: Memories of Time Past and a Glimpse at the Future," 2014 IEEE 13th International Symposium on Parallel and Distributed Computing, 2014, pp. 14-15, doi: 10.1109/ISPDC.2014.33. [2] Gartner.Inc, Gartner Forecasts Worldwide Public Cloud End-User Spending to Grow 23% in 2021, 2021. Accessed: Apr. 21, 2021. [Online]. Available: https://www.gartner.com/en/newsroom/press-releases/2021-04-21-gartner-forecasts-worldwide-public- cloud-end-user-spending-to-grow-23-percent-in-2021 [3] W. Hassan, T. S. Chou, T. Omar, and J. Pickard, “Cloud computing survey on services, enhancements and challenges in the era of machine learning and data science,” Int J Inf & Commun Technol., vol. 9, no. 2, pp. 117-139, 2020, doi: 10.11591/ijict.v9i2. [4] W. Hassan, T.-S. Chou, X. Li, P. A.-Kubi, and O. Tamer, “Latest trends, challenges and solutions in security in the era of cloud computing and software defined networks,” Int J Inf & Commun Technol., vol. 8, no. 3, pp. 162-183, 2019, doi: 10.11591/ijict.v8i3.pp162-183. [5] P. Mehrotra, J. Djomegri, S. Heistand, and R. Hood, “Performance evaluation of Amazon EC2 for NASA HPC applications," Proceedings of the 3rd workshop on Scientific Cloud Computing, 2012, pp. 41-50, doi: 10.1145/2287036.2287045. [6] A. Gupta et al., "Evaluating and Improving the Performance and Scheduling of HPC Applications in Cloud," in IEEE Transactions on Cloud Computing, vol. 4, no. 3, pp. 307-321, 1 July-Sept. 2016, doi: 10.1109/TCC.2014.2339858. [7] Z. Cai, X. Li, and J. N. D. Gupta, "Heuristics for Provisioning Services to Workflows in XaaS Clouds," in IEEE Transactions on Services Computing, vol. 9, no. 2, pp. 250-263, 1 March-April 2016, doi: 10.1109/TSC.2014.2361320. [8] M. C. Okur and M. Büyükkeçeci, "Big data challenges in information engineering curriculum," 2014 25th EAEEIE Annual Conference (EAEEIE), 2014, pp. 1-4, doi: 10.1109/EAEEIE.2014.6879372. [9] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008, doi: 10.1145/1327452.1327492. [10] T. White, Hadoop: The definitive guide, O’Reilly Media, Inc., 2012. [11] A. Hadoop, 2020. Accessed: Sep. 21, 2021. [Online]. Available: http://hadoop.apache.org
  • 9.  ISSN: 2502-4752 Indonesian J Elec Eng & Comp Sci, Vol. 25, No. 2, February 2022: 1078-1086 1086 [12] J. Zhu, J. Li, E. Hardesty, H. Jiang, and K. Li, "GPU-in-Hadoop: Enabling MapReduce across distributed heterogeneous platforms," 2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS), 2014, pp. 321-326, doi: 10.1109/ICIS.2014.6912154. [13] D. Dahiphale et al., "An Advanced MapReduce: Cloud MapReduce, Enhancements and Applications," in IEEE Transactions on Network and Service Management, vol. 11, no. 1, pp. 101-115, March 2014, doi: 10.1109/TNSM.2014.031714.130407. [14] P. Kondikoppa et al., "MapReduce-Based RESTMD: Enabling Large-Scale Sampling Tasks with Distributed HPC Systems," 2014 6th International Workshop on Science Gateways, 2014, pp. 30-35, doi: 10.1109/IWSG.2014.12. [15] S. Tang, B. Lee, and B. He, "DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters," in IEEE Transactions on Cloud Computing, vol. 2, no. 3, pp. 333-347, 1 July-Sept. 2014, doi: 10.1109/TCC.2014.2329299. [16] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica "Improving MapReduce performance in heterogeneous environments,” Osdi, vol. 8, no. 4, pp. 29-42, 2008, doi: 10.5555/1855741.1855744. [17] Y. Tao, Q. Zhang, L. Shi, and P. Chen, "Job Scheduling Optimization for Multi-user MapReduce Clusters," 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming, 2011, pp. 213-217, doi: 10.1109/PAAP.2011.33. [18] C. Kaushal and D. Koundal, “Recent Trends in Big Data using Hadoop,” International Journal of Informatics and Communication Technology,” International Journal of Informatics and Communication Technology, vol. 8, no. 1, pp. 39-49, 2019, doi: 10.11591/ijict.v8i1.pp39-49. [19] M. Zaharia, D. Borthakur, and J. S. Sarma, "Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling,” Proceedings of the 5th European conference on Computer systems, 2010, pp. 265-278, doi: 10.1145/1755913.1755940. [20] C. Wang et al., "Mapcheckreduce: an improved mapreduce computing model for imprecise applications,” IEEE International Congress on Big Data, 2014, doi: 10.1109/BigData.Congress.2014.61. [21] G. Malewicz, M. H. Austern, A. J. C. Bik, and J. C. Dehnert, "Pregel: a system for large-scale graph processing,” Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010, pp. 135-146, doi: 10.1145/1807167.1807184. [22] L. G. Valiant, "A bridging model for parallel computation,” Communications of the ACM, vol. 33, no. 8, pp. 103-111, 1990, doi: 10.1145/79173.79181. [23] T. Kajdanowicz, W. Indyk, P. Kazienko, and J. Kukul, "Comparison of the Efficiency of MapReduce and Bulk Synchronous Parallel Approaches to Large Network Processing," 2012 IEEE 12th International Conference on Data Mining Workshops, 2012, pp. 218-225, doi: 10.1109/ICDMW.2012.135. [24] T. Nguyen, W. Shi, and D. Ruden, "CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping,” BMC research notes, vol. 4, no. 1, pp. 1-7, 2011, doi: 10.1186/1756-0500-4-171. [25] M. C. Schatz, “CloudBurst: highly sensitive read mapping with MapReduce,” Bioinformatics, vol. 25, no. 11, 2009, doi: 10.1093/bioinformatics/btp236. BIOGRAPHIES OF AUTHORS Leena Ammanna is currently an Assistant Professor in the department of Information Science and Engineering, SDM College of Engineering and Technology, Dharwad, India. She received B.E. and M.Tech. Degrees from Visvesvaraya Technological University, Karnataka, India. She has 20 years of experience in teaching and research. She can be contacted at [email protected]. Jagadeeshgowda is a Professor and Head of the Department of Computer Science and Engineering, Shri Krishna Institute of Technology, Bengaluru, India. He has an experience of 24 years in research, academics and industry. He received his Ph.D. from University of JJT,India. Image processing,big data analytics, and network security are his major research interests. He has wide publications in reputed international conferences and journals. He has supervised four Ph.D. and 50 plus Post Graduate students. He has received grants from Government of India. He can be contacted at [email protected]. Jagadeesh Pujari is a Professor and Head of the Department of Information Science and Engineering, SDM College of Engineering and Technology, Dharwad, India. He has an experience of 32 years in research, academics and industry. He received his Ph.D. from University of Gulbarga, Karnataka, India.His research interests span image processing, pattern recognition, big data analytics, and machine learning. He has wide publications in reputed international conferences and journals. He has supervised five Ph.D. and 80 plus Post Graduate students. He is the investigator for several projects funded by Government of India and Industries. He can be contacted at [email protected].