0% found this document useful (0 votes)
13 views48 pages

Bigdata-cloud computing A K Mishra

The document discusses big data analytics, emphasizing its significance in agricultural sciences and the challenges faced in handling large datasets. It outlines the four layers of big data, the role of cloud computing in data analysis, and various tools and services available for bioinformatics. Additionally, it highlights the advantages and disadvantages of cloud computing, particularly in processing large datasets and the emergence of applications like CloVR for automated sequence analysis.

Uploaded by

fatmaeram49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views48 pages

Bigdata-cloud computing A K Mishra

The document discusses big data analytics, emphasizing its significance in agricultural sciences and the challenges faced in handling large datasets. It outlines the four layers of big data, the role of cloud computing in data analysis, and various tools and services available for bioinformatics. Additionally, it highlights the advantages and disadvantages of cloud computing, particularly in processing large datasets and the emergence of applications like CloVR for automated sequence analysis.

Uploaded by

fatmaeram49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Big data analytics

Dr. A. K. Mishra
Principal Scientist
ICAR-Indian Agricultural Research Institute, New Delhi
WHAT IS BIGDATA ?

• Every day we create 2.5 billion or quintillion bytes


(exabytes (EB) 1018 bytes) of data; 90% of the data in
the world today has been created in the last two years
alone.
• The data comes in various forms such as documents,
emails, images, graphs, videos, personal information, data
transactions and much more data obtained from various
new technologies.
• Big data analytics is a modern technique for collecting,
organizing, maintaining and analyzing large datasets to
discover new patterns and to uncover information which
hidden inside the data by using remote servers and
internet.
• The real-time application of big data analytics is now
Traditional Data Vs Big
Data
IN AGRICULTURAL SCIENCES

Agri-Engineers are joining the big-data club.


With the advent of high-throughput technologies, engineers are starting to
grapple with massive data sets, encountering challenges with
• Handling
• processing
• Sharing information

Challenges in handling agricultural data


• Large volume
• high throughput
• Relating and linking
• Complexity
• heterogeneity
The four layers of Big Data
• Data sources layer  This is where the data
arrives at your organization. It includes
everything from your sales records, customer
database, feedback, social media channels,
etc.
• Data storage layer  This is where the big
data is stored. Sophisticated but accessible
systems and tools have been developed –
such as Apache Hadoop DFS (distributed file
system) or Google File System (GFS) for data
storage.
• Data processing/analysis layer  To find
out something useful, we need to process and
analyze it. A common method is by using a
MapReduce tool.
• Data output layer  This is how the insights
gleaned through the analysis is passed on to
the people who can take action to benefit from
them. This output can take the form of
reports, charts, figures and key
recommendations
Analytics
Models How can we
make it
happen?
Prescriptive
What
Analytics
will
happen
Predictive
Why did ?Analytics
VALUE

it
happen?
Diagnostic
What
Analytics
happened
?
Descriptiv
e
Analytics

DIFFICULTY
How big data analytics works

Once the data is ready, it can be analyzed with the


software commonly used for advanced
analytics processes. That includes tools for:

• data mining: which sift through data sets in search


of patterns and relationships;
• predictive analytics: which build models to
forecast data behavior and other future developments;
• machine learning: which taps algorithms to
analyze large data sets; and
• deep learning: a more advanced offshoot of
machine learning.
Challenges in Biological Data Analysis

• Multiple comparisons issue high number of false positives than


true positives
• High-dimensional biological data difficult to discriminate
between two classes of data
• Small “n” and large “p” problem number of samples (n) <<
number of parameters to predict (p) in case of biological data
• Computational limitations limitations of the hardware/RAM/
number of processors etc
• Noisy high-throughput data sources of error/ difficulty in
controlling all experimental parameters
• Integration of multiple heterogeneous biological data various
forms of data/ pros and cons of redundancy
Solution to Big data analysis -
Cloud Computing
• It is the method for storing and accessing data and programs over the
Internet instead of your computer's hard drive.
• A type of computing that relies on sharing computing resources
rather than having local servers or personal devices.
• The word cloud (also phrased as "the cloud") is used as a metaphor
for "the Internet" and becomes “internet based computing”.
• Different services such as servers, storage and applications are
delivered to an organization's computers through the Internet.
• Cloud computing is comparable to grid computing.
• All computers in a network are harnessed to solve problems too
intensive for any stand-alone machine.
• Provides a platform or
service through
internet.
• Needs minimum
hardware and software
installed.
• Network may be LAN or
WAN on which
applications or
infrastructure is
deployed remotely and
user can access to meet
their needs. Cloud Computing Architecture
Evolution in Cloud Computing
200
Cloud 8
Computing
Software as a
199 Service • Anytime,
Utility
0 anywhere
Computing • Network-
Grid access to IT
Computing • Offering based resources
• Solving large computing subscriptions delivered
problems resources as a to dynamically
with parallel metered applications as a service.
computing service
ASHOKA: Advanced Super computing hub for Omics Knowledge in Agriculture
Indias’s First Supercomputing Facility for Agriculture at ICAR-IASRI, New Delhi

http://topsupercomputers-india.iisc.ernet.in/jsps/june2022/index.html
Open Source Softwares &
• Tools
Fifty two open source software and tools are configured on this
HPC environment to carry out various biological data analysis.
• These softwares and tools were identified based on online survey
conducted among researchers from National Agricultural Research
and Education System (NARES) institutions.

Categories of tools implemented in HPC.


Models of Cloud
2. Service models: Which defines the type of service cloud offers:
These are Cloud-based services in bioinformatics, classify them into Data as a
Service (DaaS), Software as a Service (SaaS), Platform as a Service (PaaS), and
Infrastructure as a Service (IaaS), and present our perspectives on the adoption of
cloud computing in bioinformatics.
Cloud computing comes in three basic
flavors: software as a service (SaaS),
platform as a service (PaaS), and
infrastructure as a service (IaaS).

CONSUME BUILD ON IT MIGRATE TO IT


Characteristics of A Cloud
Environment
• Dynamic- One of the keys to cloud computing is on-demand
provisioning
• Massively scalable- The service must react immediately to your
needs
• Multi-tenant- Cloud computing, by its nature, delivers shared
services
• Rapid elasticity- You can go from 5 servers to 50 or from 50
servers to 5
• Self-service - As a user, you can use the service as you require
• Per-usage based pricing model - You should only ever pay for
the amount of service you consume
• IP-based architecture - Cloud architectures are based on
Features of Cloud Computing –
10 Major Characteristics of
Cloud Computing
The “old” genome informatics
ecosystem
The “new” genome informatics
ecosystem based on cloud computing

Have two
options: continue
downloading
data to local
clusters as Continue to
before, or use work with the
cloud computing data via web-
– move the pages in their
cluster to the accustomed
data. way
Cloud resources in Bioinformatics
Software as a Service (SaaS)
 Bioinformatics requires a large variety of software tools for different types of
data analysis.
 Software as a Service (SaaS) cloud delivers software services online and
facilitates remote access to available bioinformatics software tools through the
Internet.
 As a consequence, SaaS eliminates the need of local installation and eases
software maintenances and updates, providing up-to-date cloud-based services for
bioinformatics data analysis over the Web.
 Efforts have been made to develop cloud-scale tools, including:
 Sequence analysis (mapping, assembly and alignment)
 Gene expression analysis
 Homology detection (orthologs and paralogs)
 Peak callers for ChIP-seq data
 Genome annotation (structural and functional)
 Identification of epistatic interactions of Single Nucleotide Polymorphisms
(SNPs)
 Various other cloud-based applications for NGS (Next-Generation Sequencing)
data analysis.
Cloud Computing Service
Providers
Cloud computing platforms have been emerging in the commercial
sector, including the Amazon Elastic Compute Cloud (EC2),
Rackspace Cloud, Flexiant and in the public sector to support
research, such as Magellan and DIAG
 It is an increasingly valuable tool for processing large datasets, and
it is already used by the US federal government, pharmaceutical
and Internet companies, as well as scientific labs and
bioinformatics services
Amazon Cloud Services

 Amazon offers a variety of bioinformatics-oriented virtual machine


images:
 Images prepopulated by Galaxy
 Bioconductor – programming environment with R statistic
package
 GBrowser – genome browser
 BioPerl
 JCVI Cloud BioLinux – collection of bioinformatics tools
including Celera Assembler
 Amazon also provides several large genomic datasets in its cloud:
 Complete copy of GeneBank (200 Gb)
 30x coverage sequencing reads of a trio of individuals from
1000 Genome Project (700Gb)
 Genome databases from Ensembl including annotated human
and 50 other species genomes
Progress in Cloud computing for
Bioinformatics
• The machine accessible interfaces for cloud computing in life sciences
have been developed based on HTTP-based Web service technologies,
for e.g.
 Simple Object Access Protocol (SOAP)
 REpresentational State Transfer (REST) services
 BioMoby
• They formalize how computers exchange messages, such as assignments,
input data, computation results, and the output of database searches
• Due to their drawbacks, open standard Extensible Messaging and
Presence Protocol (XMPP) was devised which are capable of
asynchronous communication (i.e. results are sent back to the user
automatically)
SOAP- Simple Object
Access Protocol
• HTTP and XML provide an at-hand solution that allows programs
running under different operating systems in a network to
communicate with each other called as SOAP.

• SOAP specifies exactly how to encode an HTTP header and an XML


file so that a program in one computer can call a program in another
computer and pass along information.

• SOAP also specifies how the called program can return a response.

• Despite its frequent pairing with HTTP, SOAP supports other


transport protocols as well.
REST- REpresentational State
Transfer
• REST is a style of software architecture
• In REST based application/architecture, state and functionality are
divided into distributed resources
• Every resource is uniquely addressable using a uniform and
minimal set of commands (typically using HTTP commands of
GET, POST, PUT, or DELETE over the Internet)
• The protocol is client/server, stateless, layered, and supports
caching.
• This is essentially the architecture of the Internet and explains the
popularity and ease-of-use for REST.
BioMOBY: an open source
biological web services
proposal
• BioMOBY is an Open Source research project which generates an architecture
for the discovery and distribution of biological data through web services
• Data and services are de-centralised, but the availability of these resources, and
the instructions for interacting with them, are registered in a central location
called MOBY Central.
• BioMOBY adds to the web services paradigm, as exemplified by Universal Data
Discovery and Integration (UDDI), by having an object-driven registry query
system with object and service ontologies.
• This allows users to navigate extensive and different data sets where each
possible next step is presented based on the data object currently in-hand.
• Moreover, a path from the current data object to a desired final data object could
be automatically discovered using the registry.
• Native BioMOBY objects are lightweight XML, and make up both the query and
the response of a SOAP transaction.
The databases: then
• Traditional RDBMS (relational database management system) have
been the de facto standard for database management throughout the
age of the internet. The architecture behind RDBMS is such that
data is organized in a highly-structured manner, following
the relational model. It is not a scalable solution to meet the needs
of 'big' data.
• NoSQL (commonly referred to as "Not Only SQL") represents a
completely different framework of databases that allows for high-
performance, agile processing of information at massive scale. It
has been the solution to handling some of the biggest data
warehouses on the planet – i.e. the likes of Google, Amazon, and
the CIA.
And now….What is Hadoop?
• Hadoop is not a type of database, but rather a software ecosystem
that allows for massively parallel computing. It is an enabler of
certain types NoSQL distributed databases (such as HBase), which
can allow for data to be spread across thousands of servers with
little reduction in performance.
• predicting the global Hadoop market will increase about $700 US
billion by 2022.
Contd..
• A staple of the Hadoop ecosystem is MapReduce,
a computational model that basically takes intensive data
processes and spreads the computation across a potentially
endless number of servers (generally referred to as a Hadoop
cluster).
• It has been a game-changer in supporting the enormous
processing needs of big data; a large data procedure which
might take 20 hours of processing time on a centralized
relational database system, may only take 3 minutes when
distributed across a large Hadoop cluster of commodity
servers, all processing in parallel.
MapReduce
 MapReduce is a software framework introduced by Google to
support distributed computing on large data sets on computer
clusters
 Users specify a map function that processes a key/value pair to
generate a set of intermediate key/value pairs, and a reduce
function that merges all intermediate values associated with the
same intermediate key
 MapReduce runs on a large cluster of commodity machines and
is highly scalable. It has several forms of implementation
provided by multiple programming languages, like Java, C# and
C++
Steps of MapReduce
 Map: A function called "Map," which allows different points
of the distributed cluster to distribute their work. The master
node takes the input, chops it up into smaller sub-problems,
and distributes those to worker nodes. The worker node
processes that smaller problem, and passes the answer back to
its master node.
 Reduce: A function called "Reduce," which is designed to
reduce the final form of the clusters’ results into one output.
The master node then takes the answers to all the sub-
problems and combines them in a way to get the output - the
answer to the problem it was originally trying to solve.
Ideal for Data-Intensive
Programmi
ng Model

Parallel Applications
Fault Map Moving
Computati
Tolerance
Reduce on to Data

Scalable
DECENTRALIZED MAPREDUCE
ARCHITECTURE ON CLOUD
SERVICES

Cloud Queues for scheduling, Tables to store meta-data and monitoring


data, Blobs for input/output/intermediate data storage.
Applications of MapReduce
• MapReduce is generally used in distributed grep, distributed
sort, Web link-graph reversal, Web access log stats, document
clustering, machine learning and statistical machine
translation.
• MapReduce algorithms using the cloud-ready framework
Hadoop are available for bioinformatics:
Sequence alignment
Short read mapping
SNP identification
RNA expression analysis
The pros and cons of Cloud
Computing
• It poses problems for developers and users of cloud
software
 As it requires large data transfers over precious
low-bandwidth
 Raises new privacy and security issues
 and is an inefficient solution for some types of
bioinformatics problems

• However,
 it is an increasingly valuable tool for processing
large datasets
 it is already used by the US federal government
 pharmaceutical
TECHNOLOGIES- CloVR
 A new application, Cloud Virtual Resource (CloVR) is a new
desktop application for push-button automated sequence analysis
that can utilize cloud computing resources.
 It is implemented as a single portable Virtual Machine (VM) that
provides several automated analysis pipelines for microbial
genomics, including 16S, whole genome and metagenome sequence
analysis. A virtual machine is a piece of software running on the
host computer that emulates the properties of a computer.
 The CloVR VM runs on a personal computer, utilizes local
computer resources and requires minimal installation, addressing
key challenges in deploying bioinformatics workflows.
 In addition, it supports use of remote cloud computing resources to
improve performance for large-scale sequence processing.
ISSUES IN BIOINFORMATICS
BIG DATA

Big Data generation and acquisition gives birth to profound challenges for storage, transfer
and security of the information.

1. Big data storage space would be needed by companies to store their data without any
limits. Also, the computational time is needed to be decreased with the increase in the data
for faster processing and efficient results.

2. Another challenge is to transfer the data from one location to another; it is mainly done
either by the use of external hard disks or by mail. Transfer and accessing of this data
becomes time consuming leading to decrease in processing time. Moreover, transfer of data
may reduce work efficiency. Big data has to be processed and computed simultaneously so
that we can get faster outputs which can be shared and used from any location the user
wants to access.

3. Security and the privacy of data is also a concern. In every case the most important issue
of handling bioinformatics big data is security of the data whether it is in the storage
database or while transferring of data via external hard disks or email, security issue is the
worry. Data has to be free from any threats as well as data integrity and security has to be
maintained.
Conclusion
• Cloud computing is the next big thing in Big data analytics.
• With its application sharing and cost effective properties, it is
useful for all and should be made accessible.
• It is an attractive technology at this critical juncture of current
genomic data storage and analysis.
• Cloud computing is necessarily a blind man’s stick for the
bioinformatics research. It is promising technology which
provides storage and access to data.
• The scalability of cloud to reduce traffic and Cloud
cryptography is away to ensure security.
• To harness the cloud in the beneficial and best possible way one need to
completely rely and use it in an uninterrupted way. To achieve this one
need to first optimize the commands in a proper channel in order to avoid
termination and recreation of the cloud instances
• Cloud computing came as a ray of hope for the researchers and database
organizers. This approach just condensed the resources, data and all tools
in the cloud and users can access those data in the virtual mode and can
work in cloud itself with that data without downloading and maintaining a
local copy in personal system.
• Don’t need to buy expensive resources to carry on their daily search.
Thank
You !

You might also like