0% found this document useful (0 votes)
17 views

Big Data and BDA

Uploaded by

Dev Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Big Data and BDA

Uploaded by

Dev Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

🞂​ Big data is the term for a collection of data sets

so large and complex that it becomes difficult to


process using on-hand database management
tools or traditional data processing applications.
🞂​ The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and
visualization.
🞂 ​ The trend to larger data sets is due to the additional
information derivable from analysis of a single large
setsofwith
set the data,
related sameastotal amountto of
compared data, allowing
separate smaller
correlations to be found to "spot business trends,
determine quality of research, prevent diseases, link
legal citations, combat crime, and determine real-time
roadway traffic conditions.”

1
 Massive sets of unstructured/semi-structured data
from Web traffic, social media, sensors, etc.
 Information from multiple internal and external
sources:
• Transactions
• Social media
• Enterprise content
• Sensors
• Mobile devices
 In the last minute there were …….
• 204 million emails sent
• 61,000 hours of music listened to on Pandora
• 20 million photo views
• 100,000 tweets
• 6 million views and 277,000 Facebook Logins
• 2+ million Google searches
• 3 million uploads on Flickr
🞂 ​ Lots
of data is being collected
and warehoused
◦ Web data, e-commerce
◦ purchases at department/
grocery stores
◦ Bank/Credit Card
transactions
◦ Social Network
4
🞂​ Data Volume
◦ 44x increase from 2009 to 2020
◦ From 0.8 zettabytes to 35zb
🞂​ Data volume is increasing
exponentially

Exponential increase in
collected/generated
data

5
4.6
30 billion billion
RFID tags today
12+ TBs camera
(1.3B in 2005)
of tweet data phones
every day world
wide

100s
of
data every day

millions
of

of GPS
? TBs

enabled
devices
sold
annually
25+ TBs of 5+
log data
every day billion
people
on the
76 million Web by
smart meters in end
2009… 300M by 2020
2020
🞂​ Relational Data (Tables/Transaction/Legacy
Data)
🞂​ Text Data (Web)
🞂​ Semi-structured Data (XML)
🞂​ Graph Data
◦ Social Network, Semantic Web (RDF), …

🞂​ Streaming Data
◦ You can only scan the data once

🞂​ A single application can be


generating/collecting many types of data

🞂​ Big Public Data (online, weather, finance,


etc)

To extract knowledge all these


types of data need to linked together
7
A Single View to the Customer

Bankin
Social g
Media Financ
e

Our
Know
Gamin
n
g
Histor
y

Entertai Purcha
n se
🞂 ​ Data is begin generated fast and need to be
processed fast
🞂 ​ Online Data Analytics
🞂 ​ Late decisions  missing opportunities
🞂 ​ Examples
◦ E-Promotions: Based on your current location, your purchase
history, what you like  send promotions right now for
store next to you

◦ Healthcare monitoring: sensors monitoring your activities


and body  any abnormal measurements require immediate
reaction

9
Mobile devices
(tracking all objects all the time)

Social media and Scientific instruments


networks
(all of us are generating data) (collecting all sorts of data)

Sensor technology and


networks
(measuring all kinds
of data)
🞂​ The progress and innovation is no longer hindered by the ability to collect
data
🞂​ But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable
fashion
10
11
🞂​ The Model of Generating/Consuming Data
has Changed

Old Model: Few companies are generating data, all others are
consuming data

New Model: all of us are generating data, and all of us are


consuming data

12
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time

- Ad-hoc querying and reporting


- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

13
🞂​ Let us take an analogy of a restaurant to understand the
problems associated with Big Data and how Hadoop solved
that problem

🞂​ Bob is a businessman who has opened a small restaurant.


Initially, in his restaurant, he used to receive two orders
per hour and he had one chef with one food shelf in his
restaurant which was sufficient enough to handle all the
orders
🞂​ Now let us compare the restaurant example
with the traditional scenario where data was
getting generated at a steady rate and
our traditional systems like RDBMS is
capable enough to handle it, just like
Bob’s chef. Here, you can relate the data
storage with the restaurant’s food shelf
and the traditional processing unit with the
chef as shown in the figure above.
Scenario 2
Scenario 3
◦ Hadoop is an open source software product(or, more
accurately, software library framework) that is
collaboratively produced and freely distributed by the
Apache Foundation – effectively, it is a developer’s
toolkit designed to simplify the building of Big Data
solutions.
◦ Hadoop is used by companies with very large volumes
of data to process. Among them are web giants such
as Facebook, Twitter, LinkedIn, eBay and Amazon
◦ Hadoop is a distributed data processing and
management system.
◦ It contains many , including : HDFS ,
YARN , Map Reduce.
components

Cont…..
There are three components of Hadoop.
🞂 ​ Hadoop HDFS - Hadoop
Distributed File System (HDFS) is the
storage unit of Hadoop.
🞂 ​ Hadoop MapReduce - Hadoop
MapReduce is the processing unit of Hadoop.
🞂 ​ Hadoop YARN - Hadoop YARNis
a resource management unit of
Hadoop.
🞂​ Data is stored in a distributed manner
in HDFS.
🞂 ​ There are two components of HDFS – name
node and data node. While there is only one
name node, there can be multiple data nodes.
🞂 ​ Master and slave nodes form the
HDFS cluster. The name node is called the
master, and the data nodes are called the
slaves.
🞂​ The name node is responsible for
the workings of the data nodes. It also
stores the metadata.
🞂 ​ The data nodes read, write, process,
andreplicate the data. They also
signals, known send as
name node. These heartbeats,
heartbeats show
to the
status of the data node.the
🞂​ Consider that 30TB of data is loaded into
the name node. The name node distributes it
across the data nodes, and this data is replicated
among the data notes. We can see in the
image above that the blue, grey, and red data
are replicated among the three data nodes.
🞂 ​ Replication of the data is performed three times
by default. It is done this way, so if a commodity
machine fails, you can replace it with a new
machine that has the same data.
🞂​ It is the processing unit of Hadoop. In
the MapReduce approach, the processing is
done at the slave nodes, and the final result
is sent to the master node.
🞂 ​ A data containing code is used to process the
entire data. This coded data is usually very
small in comparison to the data itself. You
only need to send a few kilobytes worth of
code to perform a heavy-duty process on
computers.
• The input dataset is first split into chunks of data. In this example,
the input has three lines of text with three separate entities -
“bus car train,” “ship ship train,” “bus ship car.” The dataset is
then split into three chunks, based on these entities, and
processed parallelly.
• In the map phase, the data is assigned a key and a value of 1. In this
case, we have one bus, one car, one ship, and one train.
• These key-value pairs are then shuffled and sorted together based
on their keys. At the reduce phase, the aggregation takes place, and
the final output is obtained.
🞂 ​ It stands for Yet Another Resource Negotiator.
🞂 ​ It is the resource management unit of Hadoop
and is available as a component of Hadoop
version 2.
🞂 ​ Hadoop YARN acts like an OS to Hadoop. It is a
file system that is built on top of HDFS.
🞂 ​ It is responsible for managing cluster resources
to make sure you don't overload one machine.
🞂 ​ It performs job scheduling to make sure that the
jobs are scheduled in the right place.
• Suppose a client machine wants to do a query or fetch some
code for data analysis. This job request goes to the resource
manager (Hadoop Yarn), which is responsible for resource
allocation and management.
• In the node section, each of the nodes has its node
managers. These node managers manage the nodes and
monitor the resource usage in the node. The containers
contain a collection of physical resources, which could be
RAM, CPU, or hard drives. Whenever a job request comes in,
the app master requests the container from the node
manager. Once the node manager gets the resource, it goes
back to the Resource Manager.
🞂​ Big Data requires tools and methods that can be
applied to analyze and extract patterns from
large-scale data.
🞂 ​ Big Data Analytics refers to the process of
collecting, organizing, analyzing large data sets
to discover different patterns and other useful
information.
🞂 ​ Big data analytics is a set of technologies and
techniques that require new forms of integration
to disclose large hidden values from large
datasets that are different from the usual ones,
more complex, and of a large enormous scale.
🞂 ​ It mainly focuses on solving new problems or old
problems in better and effective ways.
🞂​ Big data is more real-time
in nature than traditional
DW applications
🞂​ Traditional DW architectures
(e.g. Exadata,Teradata) are not
well-suited for big data apps
🞂​ Shared nothing,
parallel massively processing,
architectures
scale are
out well-suited
for big data apps

29
Traditional Big Data Analytics
Analytics
(BI)
Focus on • Descriptive •Predictive analytics
analytics •Data Science
• Diagnosis
analytics
Data Sets • Limited data sets • Large scale data
• Cleansed data sets
• Simple models • More types of
data
• Raw data
• Complex data
models
Supports • Causation: what • Correlation: new
happened, and insight More
1. Descriptive Analytics:
It consists of asking the question: What is
happening? It is a preliminary stage of data
processing that creates a set of historical data. Data
mining methods organize data and help uncover
patterns that offer insight. Descriptive analytics
provides future probabilities and trends and gives
an idea about what might happen in the future.
2. Diagnostic Analytics:
It consists of asking the question: Why did it
happen? Diagnostic analytics looks for the root
cause of a problem. It is used to determine why
something happened. This type attempts to find
and understand the causes of events and behaviors.
3. Predictive Analytics:
It consists of asking the question: What is likely
to happen? It uses past data in order to predict
the future. It is all about forecasting. Predictive
analytics uses many techniques like data mining and
artificial intelligence to analyze current data and make
scenarios of what might happen.

4. Prescriptive Analytics:
It consists of asking the question: What should be
done? It is dedicated to finding the right action to be
taken. Descriptive analytics provides a historical data,
and predictive analytics helps forecast what might
happen. Prescriptive analytics uses these parameters to
find the best solution.
DESCRIPTIVE ANALYTICS
• Descriptive analytics, such as reporting/OLAP,
dashboards, and data visualization, have been widely used
for some time.
• They are the core of traditional BI.

What has occurred?


Descriptive analytics, such as data
visualization, is important in helping
users interpret the output from
predictive and prescriptive analytics.
PREDICTIVE
• ANALYTICS
Algorithms for predictive analytics, such as regression analysis,
machine learning, and neural networks, have also been around
for some time.

What will occur?

• Marketing is the target for many predictive analytics applications.


• Descriptive analytics, such as data visualization, is important in
helping users interpret the output from predictive and
prescriptive analytics.
PRESCRIPTIVE
• Prescriptive ANALYTICS
analytics are often referred to as advanced
analytics.
• Often for the allocation of scarce resources
• Optimization

What should occur?


Prescriptive analytics can benefit healthcare strategic planning by using analytics to leverage
operational and usage data combined with data of external factors such as economic data,
population demographic trends and population health trends, to more accurately plan for
future capital investments such as new facilities and equipment utilization as well as
understand the trade-offs between adding additional beds and expanding an existing facility
versus building a new one.
CHALLENGES FACED IN BIG DATA
ANALYSIS
 Need For Synchronization Across
Disparate Data Sources
 Acute Shortage of Professionals
Who Understand Big Data Analysis
 Getting Meaningful Insights Through The Use
of Big Data Analytics
 Getting Voluminous Data Into The Big
Data Platform
 Uncertainty Of Data Management Landscape.
 Data Storage And Quality
 Security And Privacy Of Data
DATA ANALYTICS TOOLS FOR BIG DATA
ANALYSIS

Apache Spark is one of the powerful open source big data analytics tools.
It offers over 80 high-level operators that make it easy to build parallel
apps. It is one of the open source data analytics tools used at a wide
range of organizations to process large datasets.
Features:
• It helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and ten times faster on disk.
• It is one of the open source data analytics tools that offers lighting Fast
Processing.
• Support for Sophisticated Analytics.
• Ability to Integrate with Hadoop and Existing Hadoop Data.
• It is one of the open source big data analytics tools that provides built-
in APIs in Java, Scala, or Python.
Plotly is one of the big data analysis tools that lets
users create charts and dashboards to share online.
Features:
• Easily turn any data into eye-catching
and informative graphics.
• It provides audited industries with fine-
grained information on data provenance.
• Plotly offers unlimited public file hosting
through its free community plan.
Azure HDInsight is a Spark and Hadoop service in the cloud. It provides
big data cloud offerings in two categories, Standard and Premium. It
provides an enterprise-scale cluster for the organization to run their big
data workloads.
Features:
• Reliable analytics with an industry-leading SLA.
• It offers enterprise-grade security and monitoring.
• Protect data assets and extend on-premises security and governance
controls to the cloud.
• High-productivity platform for developers and scientists.
• Integration with leading productivity applications.
• Deploy Hadoop in the cloud without purchasing new hardware or
paying other up-front costs.
Skytree is one of the best big data analytics tools that empowers data
scientists to build more accurate models faster. It offers accurate
predictive machine learning models that are easy to use.
Features:
• Highly Scalable Algorithms.
• Artificial Intelligence for Data Scientists.
• It allows data scientists to visualize and understand the logic
behind
ML decisions.
• Skytree via the easy-to-adopt GUI or programmatically in Java
• Model Interpretability.
• It is designed to solve robust predictive problems with data
preparation capabilities.
• Programmatic and GUI Access.
Talend is a big data analytics software that simplifies and
automates big data integration. Its graphical wizard generates
native code. It also allows big data integration, master data
management and checks data quality.
Features:
• Simplify ETL & ELT for big data.
• Talend Big Data Platform simplifies using MapReduce and Spark
by generating native code.
• Smarter data quality with machine learning and
natural language processing.
• Agile DevOps to speed up big data projects.
• Streamline all the DevOps processes.
🞂​ Big data refers to the set of numerical data
produced by the use of new technologies for
personal or professional purposes.
🞂​ Big Data analytics is the process of
examining these data in order to uncover
hidden patterns, market trends, customer
preferences and other useful information in
order to make the right decisions.
🞂 ​ Big Data Analytics is a fast growing technology. It
has been adopted by the most unexpected
industries and became an industry on its own.
🞂 ​ But analysis of these data in the framework of the
Big Data is a process that seems sometimes quite
intrusive.
🞂​ Analytics is a data science.
🞂 ​ BI takes care of the decision-making
part while Data Analytics is the process of
asking questions.
🞂 ​ Analytics tools are used when company needs
to do a forecasting and wants to know what
will happen in the future, while BI tools help
to transform those forecasts into common
language .
🞂 ​ More often, Big Data is considered as
the successor to Business Intelligence.
Thank you

You might also like