0% found this document useful (0 votes)

67 views

Chapter N1 Introduction To Big Data

This document provides an introduction to big data. It begins with defining key big data concepts like data, information, and the 3 V's of big data - volume, velocity and variety. It then outlines different types of big data and common applications across industries like education, healthcare, government and social media. Traditional large-scale systems are limited in their ability to handle big data due to schema-on-write requirements. Distributed computing provides a superior approach for analyzing big data at large scales in a cost effective manner. The document concludes by discussing opportunities and challenges with big data.

Uploaded by

Komal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views

Chapter N1 Introduction To Big Data

Uploaded by

Komal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

INTRODUCTION TO BIG DATA

Instructor: Oussama Derbel

About Me

■ Oussama Derbel, PhD

■ Big Data and Machine Learning Trainer

■ 2 years IT Instructor

■ 6 years experience in R&D

■ 3 years experience as R&D Project Manager

■ Certificates:

1. Big Data Analytics using Spark

2. Machine Learning

3. Python For Data Science

4. AWS Certified Data Analytics (In Progress)

OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

Introduction

■ What is Data?

■ What is Big Data?

■ What is Information?
Introduction

■ What is Data?

– The quantities, characters, or symbols on which operations are performed by a computer

– It may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or

mechanical recording media.

■ What is Big Data?

– Big Data is also data but with a huge size and yet growing exponentially with time

– None of the traditional data management tools can store it or process it efficiently.
Introduction

■ What is Information?
– Information is a set of data which is processed in a meaningful way according to the given requirement.

– Information is processed, structured, or presented in a given context to make it meaningful and useful.

Data
Warehouse

ETL Process Analyse Present

Introduction

■ What is ETL?
– ETL Stands for Extract Transform and Load Data Warehouse
Introduction

■ Data Growth over the years

Introduction

■ Data Growth over the years

Byte = 8 Bit: 00100010 1 TeraByte = 103GB

1 PetaByte = 103TB
1 kilo Byte (KB) =210Byte
1 Exa Byte = 103PB
= 1024 Byte
1 ZettaByte = 103EB
1 MegaByte = 103KB

1 GegaByte = 103MB Hard Disk = 1TB

Mobile = 128 GB

Photo = 10-15 MB
Introduction

■ Data Growth over the years

Introduction

■ Data Growth over the years

Internet Data generated every day in 2020

• 500 million tweets are sent

• 294 billion emails are sent

• 4 petabytes of data are created on Facebook

• 4 terabytes of data are created from each connected car

• 65 billion messages are sent on WhatsApp

• 5 billion searches are made

• By 2025, it’s estimated that 463 exabytes of data will be created each day globally –that’s the equivalent of

212,765,957 DVDs per day!

OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

3V (Volume- Variety- Velocity) characteristics

➢ Volume:

– Large amounts of data , from datasets with sizes of terabytes to zettabyte.

➢ Velocity:

– Large amounts of data from transactions with high refresh rate resulting in data streams coming at great

speed and the time to act based on these data streams will often be very short . There is a shift from batch

processing to real time streaming.

➢ Variety:

– Data come from different data sources.

– Data can come from both internal and external data source
OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

Types of Big Data

Structured Data Semi-structured Data Unstructured Data

Databases • XML/JSON Data • Audio

• Email • Video
• Web pages • Image Data
• Natural languafe
• Documents
Types of Big Data

■ Structured Data

Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data.

■ Unstructured Data

Any data with unknown form or the structure is classified as unstructured data

■ Semi-structured Data

Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form, but

it is not defined with e.g. a table definition in relational DBMS.

OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

Application and use cases of Big Data

■ Big Data in Education Industry

• Customized and Dynamic Learning Programs
Customized programs and schemes to benefit individual students can be
created using the data collected on the bases of each student’s learning
history. This improves the overall student results.
• Reframing Course Material
Reframing the course material according to the data that is collected
on the basis of what a student learns and to what extent by real-time
monitoring of the components of a course is beneficial for the students.

• Example
The University of Alabama has more than 38,000 students and an ocean of data. In the past when there were no
real solutions to analyze that much of data, some of them seemed useless. Now, administrators are able to use
analytics and data visualizations for this data to draw out patterns of students revolutionizing the university’s
operations, recruitment, and retention efforts.
Application and use cases of Big Data

■ Big Data in Healthcare Industry

Example

• Wearable devices and sensors have been introduced in the healthcare industry

which can provide real-time feed to the electronic health record of a patient. One

such technology is from Apple.

• Apple has come up with Apple HealthKit, CareKit, and ResearchKit. The main

goal is to empower the iPhone users to store and access their real-time health

records on their phones.

Application and use cases of Big Data

■ Big Data in Government Sector

• Example

Food and Drug Administration (FDA) which runs under the jurisdiction

of the Federal Government of USA leverages from the analysis of big

data to discover patterns and associations in order to identify and

examine the expected or unexpected occurrences of food-based

infections.
Application and use cases of Big Data

■ Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook,
every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting
comments etc
Application and use cases of Big Data

■ Weather Patterns

IBM Deep Thunder, which is a research project by IBM,

provides weather forecasting through high-performance

computing of big data. IBM is also assisting Tokyo with

the improved weather forecasting for natural disasters

or predicting the probability of damaged power lines.

OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

Limitations of traditional large-scale systems

■ Problem—Schema-On-Write:

– Schema-on-write requires the data to be validated when it is written.

– This means that a lot of work must be done before new data sources can be analyzed.

– Example: Suppose a company wants to start analyzing a new source of data from unstructured or semi-
structured sources. A company will usually spend months (3–6 months) designing schemas and so on to store
the data in a data warehouse. That is 3 to 6 months that the company cannot use the data to make business
decisions. Then when the data warehouse design is completed 6 months later, often the data has changed
again. If you look at data structures from social media, they change on a regular basis. The schema-on-write
environment is too slow and rigid to deal with the dynamics of semi-structured and unstructured data
environments that are changing over a period of time.

■ The other problem with unstructured data is that traditional systems usually use Large Object Byte (LOB) types to
handle unstructured data, which is often very inconvenient and difficult to work with.
Limitations of traditional large-scale systems

■ Solution—Schema-On-Read:

– Hadoop systems are schema-on-read, which means any data can be written to the storage system
immediately. Data are not validated until they are read. This enables Hadoop systems to load any type of data
and begin analyzing it quickly.
Limitations of traditional large-scale systems

■ Problem—Cost of Storage: Traditional systems use shared storage. As organizations start to ingest larger

volumes of data, shared storage is cost prohibitive.

■ Solution—Local Storage: Hadoop can use the Hadoop Distributed File System (HDFS), a distributed file

system that leverages local disks on commodity servers. Shared storage is about $1.20/GB, whereas

local storage is about $.04/GB. Hadoop’s HDFS creates three replicas by default for high availability. So

at 12 cents per GB, it is still a fraction of the cost of traditional shared storage.
Limitations of traditional large-scale systems

■ Problem—Cost of Proprietary Hardware: Large proprietary hardware solutions can be cost prohibitive when deployed
to process extremely large volumes of data. Organizations are spending millions of dollars in hardware and software
licensing costs while supporting large data environments. Organizations are often growing their hardware in million
dollar increments to handle the increasing data. New technology in traditional vendor systems that can grow to
petabyte scale and good performance are extremely expensive.

■ Solution—Commodity Hardware: It is possible to build a high-performance super-computer environment using

Hadoop. One customer was looking at a proprietary hardware vendor for a solution. The hardware vendor’s solution
was $1.2 million in hardware costs and $3 million in software licensing. The Hadoop solution for the same
processing power was $400,000 for hardware, the software was free, and the support costs were included. Because
data volumes would be constantly increasing, the proprietary solution would have grown in $500k and $1 million
dollar increments, whereas the Hadoop solution would grow in $10,000 and $100,000 increments.
Limitations of traditional large-scale systems

■ Problem—Complexity: When you look at any traditional proprietary solution, it is full of extremely complex silos of
system administrators, DBAs, application server teams, storage teams, and network teams. Often there is one DBA
for every 40 to 50 database servers. Anyone running traditional systems knows that complex systems fail in complex
ways.

■ Solution—Simplicity: Because Hadoop uses commodity hardware and follows the “shared-nothing” architecture, it is
a platform that one person can understand very easily. Numerous organizations running Hadoop have one
administrator for every 1,000 data nodes. With commodity hardware, one person can understand the entire
technology stack.
Limitations of traditional large-scale systems

■ Problem—Causation: Because data is so expensive to store in traditional systems, data is filtered and aggregated,
and large volumes are thrown out because of the cost of storage. Minimizing the data to be analyzed reduces the
accuracy and confidence of the results. Not only are accuracy and confidence to the resulting data affected, but it
also limits an organization’s ability to identify business opportunities. Atomic data can yield more insights into the
data than aggregated data.

■ Solution—Correlation: Because of the relatively low cost of storage of Hadoop, the detailed records are stored in
Hadoop’s storage system HDFS. Traditional data can then be analyzed with non-traditional data in Hadoop to find
correlation points that can provide much higher accuracy of data analysis. We are moving to a world of correlation
because the accuracy and confidence of the results are factors higher than traditional systems. Organizations are
seeing big data as transformational. Companies building predictive models for their customers would spend weeks
or months building new profiles. Now these same companies are building new profiles and models in a few days.
One company would have a data load take 20 hours to complete, which is not ideal. They went to Hadoop and the
time for the data load went from 20 hours to 3 hours.
OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

How a distributed way of computing is superior (cost and scale)

■ Big Data Architecture

- Time taken for data

movement is high
- Operation is costly

SLOW FAST
How a distributed way of computing is superior (cost and scale)

■ Until2000: ScaleUp / Vertically

How a distributed way of computing is superior (cost and scale)

■ Scale Horizontal
OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

■ Big Data Challenges

1. Huge data sources and poor data quality

■ Big data is characterized by heterogeneous data sources like images, videos and audios.

2. Efficient Storage of Big data

■ The way Big data stored effects not only cost but also analysis and processing. To meet service and
analysis requirements in Big data reliable, high performance, high availability and low cost storage need
to be developed

3. Efficiently processing Unstructured and Semi-Structured data

■ Databases and warehouses are unsatisfactory for processing of unstructured and semi structured data.
With Big data read/write operations are highly concurrent for large number of users. As the size of
database increases, algorithm may become insufficient
Opportunities and challenges with Big Data

■ Big Data Opportunities

– Enhanced information management

■ Big Data enables enhanced discovery, access, availability, exploitation, and provisioning of information
within companies and the supply chain. It can enable the discovery of new data sets that are not yet
being used to drive value.

– Increased operations efficiency and maintenance

– Enhanced product and market strategy

■ Big Data analytics can enhance customer segmentation, allowing for better scalability and mass
personalization. It can improve customer service levels, enhance customer acquisition and sales
strategies (through web and social), as well as enabling customization of delivery.
Opportunities and challenges with Big Data

■ Big Data Opportunities

– Innovation and product design benefits

■ A wide variety of data streams can aid innovation and product design. These include utilizable product
usage data, point-of-sales data, field data from devices, customer data, and supplier suggestions to drive
product and process innovation.

– Positive financial implications

■ Big Data can reduce long-term costs, increase ability to invest, and improve understanding of cost drivers
and impacts.
References

1. https://www.erpublication.org/published_paper/IJETR042630.pdf
2. https://www.pearsonitcertification.com/articles/article.aspx?p=2427073&seqNum=2
3. https://intellipaat.com/blog/7-big-data-examples-application-of-big-data-in-real-life/
Thank you

Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
mod 3
No ratings yet
mod 3
96 pages
DS231 Module 3.PDF
No ratings yet
DS231 Module 3.PDF
41 pages
Inroduction To Data Science
No ratings yet
Inroduction To Data Science
62 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
DS231_Week_3
No ratings yet
DS231_Week_3
41 pages
BigData Processing Intro
No ratings yet
BigData Processing Intro
34 pages
Unit-III CC&BD Cs62 Ab
No ratings yet
Unit-III CC&BD Cs62 Ab
85 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
221 pages
What Is Data
No ratings yet
What Is Data
20 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
1. Data Science
No ratings yet
1. Data Science
54 pages
UNIT 1 QUESTION&ANSWERS
No ratings yet
UNIT 1 QUESTION&ANSWERS
36 pages
bda mod1
No ratings yet
bda mod1
100 pages
Unit-Iii CC&BD CS71
No ratings yet
Unit-Iii CC&BD CS71
89 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Unit 5
No ratings yet
Unit 5
63 pages
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
Introduction to Data
No ratings yet
Introduction to Data
34 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Dsbda Unit -1 - Copy
No ratings yet
Dsbda Unit -1 - Copy
21 pages
2 LecturE 1 2
No ratings yet
2 LecturE 1 2
28 pages
An Introduction To Big Data: Data Management For Data Science
No ratings yet
An Introduction To Big Data: Data Management For Data Science
32 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
The Shikshak Tyit Sem 5 Next Generation Technology Question Papers
No ratings yet
The Shikshak Tyit Sem 5 Next Generation Technology Question Papers
87 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
UNIT- 1_DA_Notes
No ratings yet
UNIT- 1_DA_Notes
51 pages
BD-Unit-1
No ratings yet
BD-Unit-1
63 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
IMP Questions pdf in Big Data
No ratings yet
IMP Questions pdf in Big Data
15 pages
Big Data Unit1
No ratings yet
Big Data Unit1
70 pages
BDA-1
No ratings yet
BDA-1
26 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
27 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
data evolution unit 1 material.docx
No ratings yet
data evolution unit 1 material.docx
28 pages
Unit 5 Concepts of Big Data and Data Lake
No ratings yet
Unit 5 Concepts of Big Data and Data Lake
15 pages
Unit - 1
No ratings yet
Unit - 1
46 pages
Introduction To Big Data Platform
No ratings yet
Introduction To Big Data Platform
20 pages
Unit I-Ch 01-Big Data Introduction
No ratings yet
Unit I-Ch 01-Big Data Introduction
40 pages
Gabungan m1-m13 Big Data
No ratings yet
Gabungan m1-m13 Big Data
558 pages
Introduction to Data Science_students
No ratings yet
Introduction to Data Science_students
237 pages
CS8091 BDA Unit I LectureNotes
No ratings yet
CS8091 BDA Unit I LectureNotes
73 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Introduction To Big Data BS (CS) 6 Lecture # 2: Dr. Syed Attique Shah (PH.D.)
No ratings yet
Introduction To Big Data BS (CS) 6 Lecture # 2: Dr. Syed Attique Shah (PH.D.)
28 pages
Unit - 1 Introduction To Big Data
No ratings yet
Unit - 1 Introduction To Big Data
29 pages
Bigdata Fundamentals
No ratings yet
Bigdata Fundamentals
82 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
big_data_in_the_future_of_workforce_-_prof_abdullah
No ratings yet
big_data_in_the_future_of_workforce_-_prof_abdullah
30 pages
Unit-1 Module Updated
No ratings yet
Unit-1 Module Updated
48 pages
DSUP Chapter 1 PDF
No ratings yet
DSUP Chapter 1 PDF
31 pages
Data Science Vs Big Data
No ratings yet
Data Science Vs Big Data
34 pages
Big Data
No ratings yet
Big Data
3 pages
Unit 1 Introduction: Data Science and Big Data: Syllabus
No ratings yet
Unit 1 Introduction: Data Science and Big Data: Syllabus
38 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
Unit - I - Types of Digital Data
No ratings yet
Unit - I - Types of Digital Data
45 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Requirements Gathering Using Object-Oriented Models: Mario Castillo
No ratings yet
Requirements Gathering Using Object-Oriented Models: Mario Castillo
53 pages
Software Testing
No ratings yet
Software Testing
36 pages
Requirements Gathering Using Object-Oriented Models: UML - Unified Modeling Language
No ratings yet
Requirements Gathering Using Object-Oriented Models: UML - Unified Modeling Language
13 pages
What Is Unified Modeling Language
No ratings yet
What Is Unified Modeling Language
4 pages
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
No ratings yet
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
37 pages
Chapter n3 Sqoop
No ratings yet
Chapter n3 Sqoop
24 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
CV DV ETL Dev
No ratings yet
CV DV ETL Dev
2 pages
DZ Data Pipeline Essentials 2024
No ratings yet
DZ Data Pipeline Essentials 2024
6 pages
Oracle Obia 111151 Cert Matrix 525376
No ratings yet
Oracle Obia 111151 Cert Matrix 525376
10 pages
Informatica PowerCenter Design Specifications SDS Template
No ratings yet
Informatica PowerCenter Design Specifications SDS Template
25 pages
AWS Glue for Handling Metadata - Analytics Vidhya
No ratings yet
AWS Glue for Handling Metadata - Analytics Vidhya
5 pages
Azure DE Roadmap2024
No ratings yet
Azure DE Roadmap2024
10 pages
Ccs341-Question-Bank NNNNNN
No ratings yet
Ccs341-Question-Bank NNNNNN
10 pages
AyushiPatra Resume
No ratings yet
AyushiPatra Resume
1 page
BI unit 1-1
No ratings yet
BI unit 1-1
23 pages
Edgardo Ortega Martinez, - MARSSG
No ratings yet
Edgardo Ortega Martinez, - MARSSG
7 pages
Saraswati K DA
No ratings yet
Saraswati K DA
6 pages
313005-Virtualization and Cloud Computing
No ratings yet
313005-Virtualization and Cloud Computing
6 pages
Teradata Data Mart Consolidation Return On Investment at GST PDF
No ratings yet
Teradata Data Mart Consolidation Return On Investment at GST PDF
20 pages
SSAS Cube Exploration Digging Through The Details With Drillthrough
No ratings yet
SSAS Cube Exploration Digging Through The Details With Drillthrough
23 pages
Ahmed Taher
No ratings yet
Ahmed Taher
5 pages
Authoring For Exari Contracts PDF
100% (2)
Authoring For Exari Contracts PDF
35 pages
Data Cleaning: Problems and Current Approaches
No ratings yet
Data Cleaning: Problems and Current Approaches
12 pages
Data Testing White Paper
No ratings yet
Data Testing White Paper
15 pages
Ab Initio Transform Components: We Have An Total of 13 Transformation Components
No ratings yet
Ab Initio Transform Components: We Have An Total of 13 Transformation Components
11 pages
Module 1: Introduction To Application Utilities
No ratings yet
Module 1: Introduction To Application Utilities
15 pages
Vinay Srivastava-Updated 2024 Informatica MDM
No ratings yet
Vinay Srivastava-Updated 2024 Informatica MDM
7 pages
Data Analytics Modelling Bcom Ba III Sem 2022-23
No ratings yet
Data Analytics Modelling Bcom Ba III Sem 2022-23
32 pages
MCQS For Practice
No ratings yet
MCQS For Practice
16 pages
RAJU AWS Data Engineer Resume
No ratings yet
RAJU AWS Data Engineer Resume
6 pages
Establishing RFC Connection Between Data Stage - SAP BW PDF
No ratings yet
Establishing RFC Connection Between Data Stage - SAP BW PDF
20 pages
Masters Thesis Ossi Kotala Metropolia-2016 PDF
No ratings yet
Masters Thesis Ossi Kotala Metropolia-2016 PDF
80 pages
UNIT-1 1) KDD: KDD (Knowledge Discovery in Database)
No ratings yet
UNIT-1 1) KDD: KDD (Knowledge Discovery in Database)
17 pages
Pavan QA
No ratings yet
Pavan QA
3 pages
Delivering Value in Procurement With Robotic Process Automation
No ratings yet
Delivering Value in Procurement With Robotic Process Automation
14 pages