Big Data and BDA
Big Data and BDA
1
Massive sets of unstructured/semi-structured data
from Web traffic, social media, sensors, etc.
Information from multiple internal and external
sources:
• Transactions
• Social media
• Enterprise content
• Sensors
• Mobile devices
In the last minute there were …….
• 204 million emails sent
• 61,000 hours of music listened to on Pandora
• 20 million photo views
• 100,000 tweets
• 6 million views and 277,000 Facebook Logins
• 2+ million Google searches
• 3 million uploads on Flickr
🞂 Lots
of data is being collected
and warehoused
◦ Web data, e-commerce
◦ purchases at department/
grocery stores
◦ Bank/Credit Card
transactions
◦ Social Network
4
🞂 Data Volume
◦ 44x increase from 2009 to 2020
◦ From 0.8 zettabytes to 35zb
🞂 Data volume is increasing
exponentially
Exponential increase in
collected/generated
data
5
4.6
30 billion billion
RFID tags today
12+ TBs camera
(1.3B in 2005)
of tweet data phones
every day world
wide
100s
of
data every day
millions
of
of GPS
? TBs
enabled
devices
sold
annually
25+ TBs of 5+
log data
every day billion
people
on the
76 million Web by
smart meters in end
2009… 300M by 2020
2020
🞂 Relational Data (Tables/Transaction/Legacy
Data)
🞂 Text Data (Web)
🞂 Semi-structured Data (XML)
🞂 Graph Data
◦ Social Network, Semantic Web (RDF), …
🞂 Streaming Data
◦ You can only scan the data once
Bankin
Social g
Media Financ
e
Our
Know
Gamin
n
g
Histor
y
Entertai Purcha
n se
🞂 Data is begin generated fast and need to be
processed fast
🞂 Online Data Analytics
🞂 Late decisions missing opportunities
🞂 Examples
◦ E-Promotions: Based on your current location, your purchase
history, what you like send promotions right now for
store next to you
9
Mobile devices
(tracking all objects all the time)
Old Model: Few companies are generating data, all others are
consuming data
12
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
13
🞂 Let us take an analogy of a restaurant to understand the
problems associated with Big Data and how Hadoop solved
that problem
Cont…..
There are three components of Hadoop.
🞂 Hadoop HDFS - Hadoop
Distributed File System (HDFS) is the
storage unit of Hadoop.
🞂 Hadoop MapReduce - Hadoop
MapReduce is the processing unit of Hadoop.
🞂 Hadoop YARN - Hadoop YARNis
a resource management unit of
Hadoop.
🞂 Data is stored in a distributed manner
in HDFS.
🞂 There are two components of HDFS – name
node and data node. While there is only one
name node, there can be multiple data nodes.
🞂 Master and slave nodes form the
HDFS cluster. The name node is called the
master, and the data nodes are called the
slaves.
🞂 The name node is responsible for
the workings of the data nodes. It also
stores the metadata.
🞂 The data nodes read, write, process,
andreplicate the data. They also
signals, known send as
name node. These heartbeats,
heartbeats show
to the
status of the data node.the
🞂 Consider that 30TB of data is loaded into
the name node. The name node distributes it
across the data nodes, and this data is replicated
among the data notes. We can see in the
image above that the blue, grey, and red data
are replicated among the three data nodes.
🞂 Replication of the data is performed three times
by default. It is done this way, so if a commodity
machine fails, you can replace it with a new
machine that has the same data.
🞂 It is the processing unit of Hadoop. In
the MapReduce approach, the processing is
done at the slave nodes, and the final result
is sent to the master node.
🞂 A data containing code is used to process the
entire data. This coded data is usually very
small in comparison to the data itself. You
only need to send a few kilobytes worth of
code to perform a heavy-duty process on
computers.
• The input dataset is first split into chunks of data. In this example,
the input has three lines of text with three separate entities -
“bus car train,” “ship ship train,” “bus ship car.” The dataset is
then split into three chunks, based on these entities, and
processed parallelly.
• In the map phase, the data is assigned a key and a value of 1. In this
case, we have one bus, one car, one ship, and one train.
• These key-value pairs are then shuffled and sorted together based
on their keys. At the reduce phase, the aggregation takes place, and
the final output is obtained.
🞂 It stands for Yet Another Resource Negotiator.
🞂 It is the resource management unit of Hadoop
and is available as a component of Hadoop
version 2.
🞂 Hadoop YARN acts like an OS to Hadoop. It is a
file system that is built on top of HDFS.
🞂 It is responsible for managing cluster resources
to make sure you don't overload one machine.
🞂 It performs job scheduling to make sure that the
jobs are scheduled in the right place.
• Suppose a client machine wants to do a query or fetch some
code for data analysis. This job request goes to the resource
manager (Hadoop Yarn), which is responsible for resource
allocation and management.
• In the node section, each of the nodes has its node
managers. These node managers manage the nodes and
monitor the resource usage in the node. The containers
contain a collection of physical resources, which could be
RAM, CPU, or hard drives. Whenever a job request comes in,
the app master requests the container from the node
manager. Once the node manager gets the resource, it goes
back to the Resource Manager.
🞂 Big Data requires tools and methods that can be
applied to analyze and extract patterns from
large-scale data.
🞂 Big Data Analytics refers to the process of
collecting, organizing, analyzing large data sets
to discover different patterns and other useful
information.
🞂 Big data analytics is a set of technologies and
techniques that require new forms of integration
to disclose large hidden values from large
datasets that are different from the usual ones,
more complex, and of a large enormous scale.
🞂 It mainly focuses on solving new problems or old
problems in better and effective ways.
🞂 Big data is more real-time
in nature than traditional
DW applications
🞂 Traditional DW architectures
(e.g. Exadata,Teradata) are not
well-suited for big data apps
🞂 Shared nothing,
parallel massively processing,
architectures
scale are
out well-suited
for big data apps
29
Traditional Big Data Analytics
Analytics
(BI)
Focus on • Descriptive •Predictive analytics
analytics •Data Science
• Diagnosis
analytics
Data Sets • Limited data sets • Large scale data
• Cleansed data sets
• Simple models • More types of
data
• Raw data
• Complex data
models
Supports • Causation: what • Correlation: new
happened, and insight More
1. Descriptive Analytics:
It consists of asking the question: What is
happening? It is a preliminary stage of data
processing that creates a set of historical data. Data
mining methods organize data and help uncover
patterns that offer insight. Descriptive analytics
provides future probabilities and trends and gives
an idea about what might happen in the future.
2. Diagnostic Analytics:
It consists of asking the question: Why did it
happen? Diagnostic analytics looks for the root
cause of a problem. It is used to determine why
something happened. This type attempts to find
and understand the causes of events and behaviors.
3. Predictive Analytics:
It consists of asking the question: What is likely
to happen? It uses past data in order to predict
the future. It is all about forecasting. Predictive
analytics uses many techniques like data mining and
artificial intelligence to analyze current data and make
scenarios of what might happen.
4. Prescriptive Analytics:
It consists of asking the question: What should be
done? It is dedicated to finding the right action to be
taken. Descriptive analytics provides a historical data,
and predictive analytics helps forecast what might
happen. Prescriptive analytics uses these parameters to
find the best solution.
DESCRIPTIVE ANALYTICS
• Descriptive analytics, such as reporting/OLAP,
dashboards, and data visualization, have been widely used
for some time.
• They are the core of traditional BI.
Apache Spark is one of the powerful open source big data analytics tools.
It offers over 80 high-level operators that make it easy to build parallel
apps. It is one of the open source data analytics tools used at a wide
range of organizations to process large datasets.
Features:
• It helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and ten times faster on disk.
• It is one of the open source data analytics tools that offers lighting Fast
Processing.
• Support for Sophisticated Analytics.
• Ability to Integrate with Hadoop and Existing Hadoop Data.
• It is one of the open source big data analytics tools that provides built-
in APIs in Java, Scala, or Python.
Plotly is one of the big data analysis tools that lets
users create charts and dashboards to share online.
Features:
• Easily turn any data into eye-catching
and informative graphics.
• It provides audited industries with fine-
grained information on data provenance.
• Plotly offers unlimited public file hosting
through its free community plan.
Azure HDInsight is a Spark and Hadoop service in the cloud. It provides
big data cloud offerings in two categories, Standard and Premium. It
provides an enterprise-scale cluster for the organization to run their big
data workloads.
Features:
• Reliable analytics with an industry-leading SLA.
• It offers enterprise-grade security and monitoring.
• Protect data assets and extend on-premises security and governance
controls to the cloud.
• High-productivity platform for developers and scientists.
• Integration with leading productivity applications.
• Deploy Hadoop in the cloud without purchasing new hardware or
paying other up-front costs.
Skytree is one of the best big data analytics tools that empowers data
scientists to build more accurate models faster. It offers accurate
predictive machine learning models that are easy to use.
Features:
• Highly Scalable Algorithms.
• Artificial Intelligence for Data Scientists.
• It allows data scientists to visualize and understand the logic
behind
ML decisions.
• Skytree via the easy-to-adopt GUI or programmatically in Java
• Model Interpretability.
• It is designed to solve robust predictive problems with data
preparation capabilities.
• Programmatic and GUI Access.
Talend is a big data analytics software that simplifies and
automates big data integration. Its graphical wizard generates
native code. It also allows big data integration, master data
management and checks data quality.
Features:
• Simplify ETL & ELT for big data.
• Talend Big Data Platform simplifies using MapReduce and Spark
by generating native code.
• Smarter data quality with machine learning and
natural language processing.
• Agile DevOps to speed up big data projects.
• Streamline all the DevOps processes.
🞂 Big data refers to the set of numerical data
produced by the use of new technologies for
personal or professional purposes.
🞂 Big Data analytics is the process of
examining these data in order to uncover
hidden patterns, market trends, customer
preferences and other useful information in
order to make the right decisions.
🞂 Big Data Analytics is a fast growing technology. It
has been adopted by the most unexpected
industries and became an industry on its own.
🞂 But analysis of these data in the framework of the
Big Data is a process that seems sometimes quite
intrusive.
🞂 Analytics is a data science.
🞂 BI takes care of the decision-making
part while Data Analytics is the process of
asking questions.
🞂 Analytics tools are used when company needs
to do a forecasting and wants to know what
will happen in the future, while BI tools help
to transform those forecasts into common
language .
🞂 More often, Big Data is considered as
the successor to Business Intelligence.
Thank you