Unit 2 Da
Unit 2 Da
Research, Indore
www.acropolis.in
Data Analytics
By: Mr. Ronak Jain
Table of Contents
UNIT-II:
INTRODUCTION TO BIG DATA: Big Data and its Importance, Four V’s of Big Data,Drivers for Big Data, Introduction
to Big Data Analytics, Big Data Analytics applications.
BIG DATA TECHNOLOGIES: Hadoop’s Parallel World, Data discovery, Open source technology for Big Data Analytics,
cloud and Big Data, Predictive Analytics, Mobile Business Intelligence and Big Data, Crowd Sourcing Analytics, Inter- and
Trans-Firewall Analytics, Information Management.
“Extremely large data sets that may be analyzed computationally to reveal patterns ,
trends and association, especially relating to human behavior and interaction are
known as Big Data.”
Examples Of Big Data
Following are some the examples of Big Data-
The New York Stock Exchange generates about one terabyte of new trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social
media site Facebook, every day. This data is mainly generated in terms of photo and
video uploads, message exchanges, putting comments etc.
TWITTER
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
Tabular Representation of various Memory Sizes
Yottabyte 1, 024 zettabytes 1, 208, 925, 819, 614, 629, 174, 706, 176
Characteristics Of Big Data
• The following are known as “Big Data Characteristics”.
1. Volume
2. Velocity
3. Variety
4. Veracity
1. Volume:
Volume means “How much Data is generated”. Now-a-days,
Organizations or Human Beings or Systems are generating or getting
very vast amount of Data say TB(Tera Bytes) to PB(Peta Bytes) to Exa
Byte(EB) and more.
2. Velocity:
Velocity means “How fast produce Data”. Now-a-days, Organizations or
Human Beings or Systems are generating huge amounts of Data at very
fast rate.
3. Variety:
Variety means “Different forms of Data”. Now-a-days, Organizations or
Human Beings or Systems are generating very huge amount of data at very fast
rate in different formats. We will discuss in details about different formats of
Data soon.
4. Veracity
Veracity means “The Quality or Correctness or Accuracy of Captured Data”.
Out of 4Vs, it is most important V for any Big Data Solutions. Because
without
Correct Information or Data, there is no use of storing large amount of data
at fast rate and different formats. That data should give correct business
value.
Types of Digital Data
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format
is termed as a 'structured' data.
Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it.
However, nowadays, we are foreseeing issues when a size of such data grows
to a huge extent, typical sizes are being in the range of multiple zettabytes.
Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
Looking at these figures one can easily understand why the name Big Data is
given and imagine the challenges involved in its storage and processing.
Do you know? Data stored in a relational database management system is
one example of a 'structured' data.
happen?
Diagnostic
What
E
Analytics
happened?
Descriptiv
e
Analytics
DIFFICULTY
Big Data technologies can be divided into two groups: batch processing,
which are analytics on data at rest, and stream processing, which are
analytics on data in motion
Big Data Analytics
Big Data Analytics:
Big Data analytics is the process of collecting, organizing and analyzing
large sets of data (called Big Data) to discover patterns and other useful
information.
Big Data analytics can help organizations to better understand the
information contained within the data and will also help identify the data
that is most important to the business and future business decisions.
Analysts working with Big Data typically want the knowledge that comes
from analyzing the data.
High-Performance Analytics Required:
To analyze such a large volume of data, Big Data analytics is typically
performed using specialized software tools and applications for
predictive analytics, data mining, text mining, forecasting and
data optimization.
Collectively these processes are separate but highly integrated functions of
high-performance analytics.
Using Big Data tools and software enables an organization to process extremely
large volumes of data that a business has collected to determine which data is
relevant and can be analyzed to drive better business decisions in the future.
The Challenges:
For most organizations, Big Data analysis is a challenge. Consider the sheer
volume of data and the different formats of the
data(both structured and unstructured data) that is collected across the entire
organization and the many different ways different types of data can be
combined, contrasted and analyzed to find patterns and other useful business
information.
The first challenge is in breaking down data silos to access all data an
organization stores in different places and often in different systems.
A second challenge is in creating platforms that can pull in unstructured data as
easily as structured data.
This massive volume of data is typically so large that it's difficult to process
using traditional database and software methods.
How Big Data Analytics is Used Today:
As the technology that helps an organization to break down data silos and analyze
data improves, business can be transformed in all sorts of ways.
Today's advances in analyzing big data allow researchers to decode human DNA in
minutes, predict where terrorists plan to attack, determine which gene is mostly likely
to be responsible for certain diseases and, of course, which ads you are most likely
to respond to on Facebook.
Another example comes from one of the biggest mobile carriers in the world.
France's Orange launched its Data for Development project by releasing subscriber
data for customers in the Ivory Coast.
The 2.5 billion records, which were made anonymous, included details on calls
and text messages exchanged between 5 million users.
Researchers accessed the data and sent Orange proposals for how the data could serve
as the foundation for development projects to improve public health and safety.
Proposed projects included one that showed how to improve public safety by tracking
cell phone data to map where people went after emergencies; another showed how to
use cellular data for disease containment. (source)
Application of Big Data
Let’s discuss the applications of Big Data in detail.
See how Big Data is revolutionizing the travel & tourism sector.
Through Big Data and analytics, travel companies are now able to offer more
customized traveling experience. They are now able to understand their customer’s
requirements in a much-enhanced way.
From providing them with the best offers to be able to make suggestions in real-time,
Big Data is certainly a perfect guide for any traveler. Big Data is gradually taking
the window seat in the travel industry.
8. Big Data in
Telecom
The telecom industry is the soul of every digital revolution that takes place around the world.
With the ever-increasing popularity of smartphones, it has flooded the telecom industry with
massive amounts of data.
And this data is like a goldmine, telecom companies just need to know how to dig it properly.
Through Big Data and analytics, companies are able to provide the customers with smooth
connectivity, thus eradicating all the network barriers that the customers have to deal with.
Companies now with the help of Big Data and analytics can track the areas with the lowest as
well as the highest network traffics and thus doing the needful to ensure hassle-free network
connectivity.
Big Data alike other industries have helped the telecom industry to understand its customers
pretty well.
Telecom industries now provide customers with offers as customized as possible.
Big Data has been behind the data revolution we are currently experiencing.
9. Big Data in
Automobile
“A business like an automobile, has to be driven, in order to get results.” B.C. Forbes
And Big Data has now taken complete control of the automobile industry and is driving it
smoothly. Big Data is driving the automobile industry towards some unbelievable and never
before results.
The automobile industry is on a roll and Big Data is its wheels or I must say Big Data has
given wings to it. Big Data has helped the automobile industry achieve things that were
beyond our imaginations
From analyzing the trends to understanding the supply chain management, from taking
care of its customers to turning our wildest dream of connected cars a reality, Big Data is
well and truly driving the automobile industry crazy.
Applications and key data sources for big data
and business analytics
Use cases for Big data analytics
Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the data.
CPU
Memory
Disk
Classical Data Analytics
CPU
Memory
(64 GB)
Disk
Classical Data Analytics
CPU
Memory
(64 GB)
Disk
Classical Data Analytics
CPU
Memory
(64 GB)
Disk
IO Bounded
Reading a word from disk versus main memory: 105 slower!
Reading many contiguously stored words is
faster per word, but fast modern disks
still only reach ~1GB/s for sequential reads.
IO Bounded
Reading a word from disk versus main memory: 105 slower!
Reading many contiguously stored words is
faster per word, but fast modern disks
still only reach ~1GB/s for sequential reads.
CPU
Classical focus: efficient use of disk.
e.g. Apache Lucene / Solr
Memory
How to
solve?
Classical limitation: Still bounded when
needing to process all of a large file.
Distributed Architecture
Switch
~10Gbps
Rack 1
Rack 2
Switch Switch
~1Gbps ~1Gbps
...
Switch
~10Gbps
Rack 1
Rack 2
Switch Switch
~1Gbps ~1Gbps
...
1. Nodes fail
1 in 1000 nodes fail a day
2. Network is a bottleneck
Typically 1-10 Gb/s throughput
Memory
Disk
Distributed Filesystem
https://opensource.com/life/14/8/intro
-apache-hadoop-big-data
C
D
Distributed Files ystem
“Hadoop” was named after
a toy elephant belonging to
Doug
Cutting’s son. Cutting was
(e.g. Apache HadoopDFS, GoogleFS, E MRFS)
one of Hadoop’s creators.
C, D: Two different files
https://opensource.com/life/14/8/intro
-apache-hadoop-big-data
C
D
Distributed Filesystem
C0 D0
C1 D1
C2 D2
C3 D3
C4 D4
C5 D5
Distributed Filesystem
Rack switch