0% found this document useful (0 votes)
40 views

Unit 2 Da

Big data analytics is the process of analyzing large datasets to discover patterns and useful information. It allows organizations to better understand patterns in data to make business decisions. Analyzing big data requires specialized software tools due to the large volume and complexity of data. Challenges include integrating data from different sources and systems in different formats. Today, big data analytics is used to decode DNA, predict terrorist attacks, and determine disease genes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Unit 2 Da

Big data analytics is the process of analyzing large datasets to discover patterns and useful information. It allows organizations to better understand patterns in data to make business decisions. Analyzing big data requires specialized software tools due to the large volume and complexity of data. Challenges include integrating data from different sources and systems in different formats. Today, big data analytics is used to decode DNA, predict terrorist attacks, and determine disease genes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Acropolis Institute of Technology &

Research, Indore
www.acropolis.in
Data Analytics
By: Mr. Ronak Jain
Table of Contents
UNIT-II:
INTRODUCTION TO BIG DATA: Big Data and its Importance, Four V’s of Big Data,Drivers for Big Data, Introduction
to Big Data Analytics, Big Data Analytics applications.
BIG DATA TECHNOLOGIES: Hadoop’s Parallel World, Data discovery, Open source technology for Big Data Analytics,
cloud and Big Data, Predictive Analytics, Mobile Business Intelligence and Big Data, Crowd Sourcing Analytics, Inter- and
Trans-Firewall Analytics, Information Management.

December 13, 2023 3


 Introduction to Big Data
 What is Data?
The quantities, characters, or symbols on which operations are performed by a computer,
which may be stored and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media.
 What is Big Data?
Big Data is also data but with a huge size. Big Data is a term used to describe a
collection of data that is huge in volume and yet growing exponentially with time. In
short such data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.

 “Extremely large data sets that may be analyzed computationally to reveal patterns ,
trends and association, especially relating to human behavior and interaction are
known as Big Data.”
 Examples Of Big Data
Following are some the examples of Big Data-
 The New York Stock Exchange generates about one terabyte of new trade data per day.
 Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social
media site Facebook, every day. This data is mainly generated in terms of photo and
video uploads, message exchanges, putting comments etc.
TWITTER

 A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
Tabular Representation of various Memory Sizes

Name Equal To Size(In Bytes)


Bit 1 bit 1/8
Nibble 4 bits 1/2 (rare)
Byte 8 bits 1
Kilobyte 1024 bytes 1024
Megabyte 1, 024kilobytes 1, 048, 576
Gigabyte 1, 024 megabytes 1, 073, 741, 824
Terrabyte 1, 024 gigabytes 1, 099, 511, 627, 776
Petabyte 1, 024 terrabytes 1, 125, 899, 906, 842, 624
Exabyte 1, 024 petabytes 1, 152, 921, 504, 606, 846, 976
Zettabyte 1, 024 exabytes 1, 180, 591, 620, 717, 411, 303, 424

Yottabyte 1, 024 zettabytes 1, 208, 925, 819, 614, 629, 174, 706, 176
 Characteristics Of Big Data
• The following are known as “Big Data Characteristics”.
1. Volume
2. Velocity
3. Variety
4. Veracity
1. Volume:
Volume means “How much Data is generated”. Now-a-days,
Organizations or Human Beings or Systems are generating or getting
very vast amount of Data say TB(Tera Bytes) to PB(Peta Bytes) to Exa
Byte(EB) and more.
2. Velocity:
Velocity means “How fast produce Data”. Now-a-days, Organizations or
Human Beings or Systems are generating huge amounts of Data at very
fast rate.

3. Variety:
Variety means “Different forms of Data”. Now-a-days, Organizations or
Human Beings or Systems are generating very huge amount of data at very fast
rate in different formats. We will discuss in details about different formats of
Data soon.
4. Veracity
Veracity means “The Quality or Correctness or Accuracy of Captured Data”.
Out of 4Vs, it is most important V for any Big Data Solutions. Because
without
Correct Information or Data, there is no use of storing large amount of data
at fast rate and different formats. That data should give correct business
value.
 Types of Digital Data
1. Structured
2. Unstructured
3. Semi-structured

 Structured
 Any data that can be stored, accessed and processed in the form of fixed format
is termed as a 'structured' data.
 Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it.
 However, nowadays, we are foreseeing issues when a size of such data grows
to a huge extent, typical sizes are being in the range of multiple zettabytes.

 Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
Looking at these figures one can easily understand why the name Big Data is
given and imagine the challenges involved in its storage and processing.
 Do you know? Data stored in a relational database management system is
one example of a 'structured' data.

• Examples Of Structured Data


An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000


3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
 Unstructured
 Any data with unknown form or the structure is classified as unstructured data.
 In addition to the size being huge, un-structured data poses multiple challenges in terms
of its processing for deriving value out of it.
 A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc.
 Now day organizations have wealth of data available with them but unfortunately, they
don't know how to derive value out of it since this data is in its raw form or
unstructured format.
• Examples Of Un-structured Data
The output returned by 'Google Search'
Data
Analytics Models
How can we
make it happen?
Prescriptive
What will
Analytics
happen?
Predictive
Why did it
Analytics
VALU

happen?
Diagnostic
What
E

Analytics
happened?
Descriptiv
e
Analytics

DIFFICULTY
Big Data technologies can be divided into two groups: batch processing,
which are analytics on data at rest, and stream processing, which are
analytics on data in motion
 Big Data Analytics
 Big Data Analytics:
 Big Data analytics is the process of collecting, organizing and analyzing
large sets of data (called Big Data) to discover patterns and other useful
information.
 Big Data analytics can help organizations to better understand the
information contained within the data and will also help identify the data
that is most important to the business and future business decisions.
Analysts working with Big Data typically want the knowledge that comes
from analyzing the data.
 High-Performance Analytics Required:
 To analyze such a large volume of data, Big Data analytics is typically
performed using specialized software tools and applications for
predictive analytics, data mining, text mining, forecasting and
data optimization.
 Collectively these processes are separate but highly integrated functions of
high-performance analytics.
 Using Big Data tools and software enables an organization to process extremely
large volumes of data that a business has collected to determine which data is
relevant and can be analyzed to drive better business decisions in the future.
 The Challenges:
 For most organizations, Big Data analysis is a challenge. Consider the sheer
volume of data and the different formats of the
data(both structured and unstructured data) that is collected across the entire
organization and the many different ways different types of data can be
combined, contrasted and analyzed to find patterns and other useful business
information.
 The first challenge is in breaking down data silos to access all data an
organization stores in different places and often in different systems.
 A second challenge is in creating platforms that can pull in unstructured data as
easily as structured data.
 This massive volume of data is typically so large that it's difficult to process
using traditional database and software methods.
 How Big Data Analytics is Used Today:
 As the technology that helps an organization to break down data silos and analyze
data improves, business can be transformed in all sorts of ways.
 Today's advances in analyzing big data allow researchers to decode human DNA in
minutes, predict where terrorists plan to attack, determine which gene is mostly likely
to be responsible for certain diseases and, of course, which ads you are most likely
to respond to on Facebook.
 Another example comes from one of the biggest mobile carriers in the world.
 France's Orange launched its Data for Development project by releasing subscriber
data for customers in the Ivory Coast.
 The 2.5 billion records, which were made anonymous, included details on calls
and text messages exchanged between 5 million users.
 Researchers accessed the data and sent Orange proposals for how the data could serve
as the foundation for development projects to improve public health and safety.
 Proposed projects included one that showed how to improve public safety by tracking
cell phone data to map where people went after emergencies; another showed how to
use cellular data for disease containment. (source)
 Application of Big Data
 Let’s discuss the applications of Big Data in detail.

1. Big Data in Retail


 The retail industry is the one that faces the most fierce competition of all. Retailers
constantly hunt for ways that will give them a competitive edge over others.
Customers are the real king sounds legit for the retail industry in particular.
 For retailers to thrive in this competitive world, they need to understand their
customers in a better way. If they are aware of their customers’ needs and how to
fulfill those needs in the best possible way, then they know everything.
 Check how Big Data act as a weapon for retailers to connect with their customers
– Big Data in Retail.
 Through advanced analysis of their customer’s data, retailers are now able to
understand them from every angle possible. They gather this data from
various sources such as social media, loyalty programs, etc.
 Even a minute detail about any customer has now become significant for them. They are
now closer to their customers than they have ever been. This empowers them to provide
customers with more personalized services and predict their demands in advance.
 This helps them in building a loyal customer base. Some of the biggest names in the retail
world like Walmart, Sears and Holdings, Costco, Walgreens, and many more now have Big
Data as an integral part of their organizations.
 A study by the National Retail Federation estimated that sales in November and December
are responsible for as much as 30% of retail annual sales.
2. Big Data in
Healthcare
 Big Data and healthcare are an ideal match. It complements the healthcare industry better
than anything ever will. The amount of data the healthcare industry has to deal with is
unimaginable.
 Gone are the days when healthcare practitioners were incapable of harnessing this data.
From finding a cure to cancer to detecting Ebola and much more, Big Data has got it
all under its belt and researchers have seen some life-saving outcomes through it.
 Big Data and analytics have given them the license to build more personalized
medications. Data analysts are harnessing this data to develop more and more effective
treatments. Identifying unusual patterns of certain medicines to discover ways for
developing more economical solutions is a common practice these days.
 Explore how Big Data helps to speed up the treatment process –
Big Data in Healthcare.
 Smart wearables have gradually gained popularity and are the latest trend among
people of all age groups. This generates massive amounts of real-time data in
the form of alerts which helps in saving the lives of the people.
3. Big Data in
Education
 When you ask people about the use of the data that an educational institute gathers, the
majority of the people will have the same answer that the institute or the student might
need it for future references.
 Even you had the same perception about this data, didn’t you? But the fact is, this data
holds enormous importance. Big Data is the key to shaping the future of the people
and has the power to transform the education system for better.
 Some of the top universities are using Big Data as a tool to renovate their academic
curriculum. Additionally, universities can even track the dropout rates of the students
and are taking the required measures to reduce this rate as much as possible.
4. Big Data in E-commerce
 One of the greatest revolutions this generation has seen is that of E-commerce. It is now
part and parcel of our routine life. Whenever we need to buy something, the first thought
that provokes our mind is E-commerce. And not your surprise, Big Data has been the face
of it.
 Some of the biggest E-commerce companies of the world like Amazon, Flipkart, Alibaba, and
many more are now bound to Big Data and analytics is itself an evidence of the level of
popularity Big Data has gained in recent times.
 Big Data is now as important as anyone else in these organizations. Amazon, the biggest E-
commerce firm in the world and one of the pioneers of Big Data and analytics, has Big Data as
the backbone of its system. Flipkart, the biggest E-commerce firm in India, has one of the most
robust data platforms in the country.
 See how Flipkart used Big Data to have one of the most robust data platforms.
 Big Data’s recommendation engine is one of the most amazing applications the Big Data world
has ever witnessed. It furnishes the companies with a 360-degree view of its customers.
 Companies then suggest customers accordingly. Customers now experience more personalized
services than they have ever had. Big Data has completely redefined people’s online shopping
experiences.
6. Big Data in
Finance
 The functioning of any financial organization depends heavily on its data and to safeguard that
data is one of the toughest challenges any financial firm faces. Data has been the second most
important commodity for them after money.
 Even before Big Data gained popularity, the finance industry was already conquering the
technical field. In addition to it, financial firms were among the earliest adopters of Big Data
and Analytics.
 Digital banking and payments are two of the most trending buzzwords around and Big
data has been at the heart of it. Big Data is bossing the key areas of financial firms such as
fraud detection, risk analysis, algorithmic trading, and customer contentment.
 This has brought much-needed fluency in their systems. They are now empowered to focus
more on providing better services to their customers rather than focussing on security issues.
Big Data has now enhanced the financial system with answers to its hardest of the challenges.
 While Big Data is spreading like wildfire and various industries have been cooking its food
with it, the travel industry was a bit late to realize its worth. Better late than never though.
Having a stress-free traveling experience is still like a daydream for many.
 And now Big Data’s arrival is like a ray of hope, that will mark the departure of all the
hindrances in our smooth traveling experience.

7. Big Data in Travel Industry

 See how Big Data is revolutionizing the travel & tourism sector.
 Through Big Data and analytics, travel companies are now able to offer more
customized traveling experience. They are now able to understand their customer’s
requirements in a much-enhanced way.
 From providing them with the best offers to be able to make suggestions in real-time,
Big Data is certainly a perfect guide for any traveler. Big Data is gradually taking
the window seat in the travel industry.
8. Big Data in
Telecom
 The telecom industry is the soul of every digital revolution that takes place around the world.
With the ever-increasing popularity of smartphones, it has flooded the telecom industry with
massive amounts of data.
 And this data is like a goldmine, telecom companies just need to know how to dig it properly.
Through Big Data and analytics, companies are able to provide the customers with smooth
connectivity, thus eradicating all the network barriers that the customers have to deal with.
 Companies now with the help of Big Data and analytics can track the areas with the lowest as
well as the highest network traffics and thus doing the needful to ensure hassle-free network
connectivity.
 Big Data alike other industries have helped the telecom industry to understand its customers
pretty well.
 Telecom industries now provide customers with offers as customized as possible.
 Big Data has been behind the data revolution we are currently experiencing.
9. Big Data in
Automobile
 “A business like an automobile, has to be driven, in order to get results.” B.C. Forbes
 And Big Data has now taken complete control of the automobile industry and is driving it
smoothly. Big Data is driving the automobile industry towards some unbelievable and never
before results.
 The automobile industry is on a roll and Big Data is its wheels or I must say Big Data has
given wings to it. Big Data has helped the automobile industry achieve things that were
beyond our imaginations

 From analyzing the trends to understanding the supply chain management, from taking
care of its customers to turning our wildest dream of connected cars a reality, Big Data is
well and truly driving the automobile industry crazy.
Applications and key data sources for big data
and business analytics
Use cases for Big data analytics
Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the data.

Data Workflow Analytics and Algorithms


Frameworks
Hadoop File Similarity Search
Spar Hypothesis
System
k Testing Transformers/Self-
MapReduce Supervision
Deep Learning
Streaming Recommendation
Frameworks
Systems Link
Big Data Analytics, The Class
W
Sy ork
ste flo s
m w m
s ir th
o
lg
A

Big Data Analytics


D
cal To istr
isti s ol ibu
at hod s te
t
S et d
M
Big Data Analytics, The Class
W
Sy ork
ste flo s
m w m
s ir th
o
lg
A

Big Data Analytics


D
cal To istr
isti s ol ibu
at hod s te
t
S et d
M
Data
Classical Data Analytics

CPU

Memory

Disk
Classical Data Analytics

CPU

Memory
(64 GB)

Disk
Classical Data Analytics

CPU

Memory
(64 GB)

Disk
Classical Data Analytics

CPU

Memory
(64 GB)

Disk
IO Bounded
Reading a word from disk versus main memory: 105 slower!
Reading many contiguously stored words is
faster per word, but fast modern disks
still only reach ~1GB/s for sequential reads.
IO Bounded
Reading a word from disk versus main memory: 105 slower!
Reading many contiguously stored words is
faster per word, but fast modern disks
still only reach ~1GB/s for sequential reads.

IO Bound: biggest performance bottleneck is


reading / writing to disk.

starts around 500 GBs: >10 minutes


just to read
Classical Big Data

CPU
Classical focus: efficient use of disk.
e.g. Apache Lucene / Solr
Memory

Disk Classical limitation: Still bounded when


needing to process all of a large file.
Classical Big Data

How to
solve?
Classical limitation: Still bounded when
needing to process all of a large file.
Distributed Architecture
Switch
~10Gbps

Rack 1
Rack 2
Switch Switch
~1Gbps ~1Gbps
...

CPU CPU CPU CPU CPU CPU

Memory Memory ... Memory Memory Memory ... Memory

Disk Disk Disk Disk Disk Disk


Distributed Architecture
In reality, modern setups often have multiple cpus and disks per
server, but we will model as if one machine
per cpu-disk pair.
Switch
~1Gbps

CPU CPU CPU CPU CPU CPU


... ...
...
Memory Memory

Disk Disk ... Disk Disk Disk ... Disk


Distributed Architecture (Cluster)

Switch
~10Gbps

Rack 1
Rack 2
Switch Switch
~1Gbps ~1Gbps
...

CPU CPU CPU CPU CPU CPU

Memory Memory ... Memory Memory Memory ... Memory

Disk Disk Disk Disk Disk Disk


Distributed Architecture (Cluster)

Challenges for IO Cluster Computing

1. Nodes fail
1 in 1000 nodes fail a day

2. Network is a bottleneck
Typically 1-10 Gb/s throughput

3. Traditional distributed programming is


often ad-hoc and complicated
Distributed Architecture (Cluster)

Challenges for IO Cluster Computing


1. Nodes fail
1 in 1000 nodes fail a day
Duplicate Data
2. Network is a bottleneck
Typically 1-10 Gb/s throughput
Bring computation to nodes, rather than
data to nodes.
3. Traditional distributed programming is
often ad-hoc and complicated Stipulate a
programming system that can easily be
distributed
Distributed Architecture (Cluster)

Challenges for IO Cluster Computing


1. Nodes fail
1 in 1000 nodes fail a day
Duplicate Data
2. Network is a bottleneck
Typically 1-10 Gb/s throughput HDFS with
Bring computation to nodes, rather than MapReduce
data to nodes.
accomplishes
3. Traditional distributed programming is
all!
often ad-hoc and complicated Stipulate a
programming system that can easily be
distributed
Distributed Filesystem

The effectiveness of MapReduce, Spark, and other


distributed processing systems is in part simply
due to use of a distributed filesystem!
Distributed Filesystem

Characteristics for Big Data Tasks


Large files (i.e. >100 GB to TBs) Reads
are most common
No need to update in place
(append preferred)
CPU

Memory

Disk
Distributed Filesystem

(e.g. Apache HadoopDFS, GoogleFS, EMRFS)

C, D: Two different files

https://opensource.com/life/14/8/intro
-apache-hadoop-big-data

C
D
Distributed Files ystem
“Hadoop” was named after
a toy elephant belonging to
Doug
Cutting’s son. Cutting was
(e.g. Apache HadoopDFS, GoogleFS, E MRFS)
one of Hadoop’s creators.
C, D: Two different files

https://opensource.com/life/14/8/intro
-apache-hadoop-big-data

C
D
Distributed Filesystem

(e.g. Apache HadoopDFS, GoogleFS, EMRFS)

C, D: Two different files; break into chunks (or "partitions"):

C0 D0

C1 D1

C2 D2

C3 D3

C4 D4

C5 D5
Distributed Filesystem

(e.g. Apache HadoopDFS, GoogleFS, EMRFS)

C, D: Two different files

chunk server 1 chunk server 2 chunk server 3 chunk server n

(Leskovec at al., 2014; http://www.mmds.org/)


Distributed Filesystem

(e.g. Apache HadoopDFS, GoogleFS, EMRFS)

C, D: Two different files

chunk server 1 chunk server 2 chunk server 3 chunk server n

(Leskovec at al., 2014; http://www.mmds.org/)


Distributed Filesystem

(e.g. Apache HadoopDFS, GoogleFS, EMRFS)

C, D: Two different files

chunk server 1 chunk server 2 chunk server 3 chunk server n

(Leskovec at al., 2014; http://www.mmds.org/)


Distributed Filesystem

Chunk servers (on Data Nodes)


File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks

(Leskovec at al., 2014; http://www.mmds.org/)


Components of a Distributed Filesystem
Chunk servers (on Data Nodes)
File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks
Name node (aka master node)
Stores metadata about where files are
stored
Might be replicated or distributed
across data nodes.

(Leskovec at al., 2014; http://www.mmds.org/)


Components of a Distributed Filesystem
Chunk servers (on Data Nodes)
File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks
Name node (aka master node)
Stores metadata about where files are
stored
Might be replicated or distributed
across data nodes.
Client library for file access
Talks to master to find chunk servers
Connects directly to chunk servers to
(Leskovec at al., 2014; http://www.mmds.org/)
access data
Distributed Architecture (Cluster)

Challenges for IO Cluster Computing


1. Nodes fail
1 in 1000 nodes fail a day
Duplicate Data (Distributed FS)
2. Network is a bottleneck
Typically 1-10 Gb/s throughput
Bring computation to nodes, rather than
data to nodes.
3. Traditional distributed programming is
often ad-hoc and complicated Stipulate a
programming system that can easily be
distributed
Hadoop - Why ?

Need to process huge datasets on large


clusters of computers
Very expensive to build reliability into each
application
Nodes fail every day
Failure is expected, rather than exceptional
The number of nodes in a cluster is not constant
Need a common infrastructure
Efficient, reliable, easy to use
Open Source, Apache Licence
Who uses Hadoop?
Amazon/A9
Facebook
Google
New York Times
Veoh
Yahoo!
…. many more
Commodity Hardware
Aggregation switch

Rack switch

Typically in 2 level architecture


 Nodes are commodity PCs
 30-40 nodes/rack
 Uplink from rack is 3-4 gigabit
 Rack-internal is 1 gigabit

You might also like