SlideShare a Scribd company logo
Data Analytics
A Lecture Series
on
Dr.Chitra A.Dhawale
P.R.Pote College of Engg.and Mgmt.
Data Analytics:  HDFS  with  Big Data :  Issues and Application
Data Analytics (MCA19304)
COURSE OUTCOMES
AT THE END OF COURSE THE STUDENT SHOULD BE ABLE TO :
1. DEVELOP AND MAINTAIN RELIABLE, SCALABLE SYSTEMS USING
APACHE, HADOOP
2. WRITE MAP REDUCE BASED APPLICATION
3. DIFFERENTIATE BETWEEN CONVENTIONAL SQL AND NOSQL
4. ANALYZE AND DEVELOP BIG DATA SOLUTIONS USING HIVE AND PIG
Data Analytics (MCA19304)
UNIT I
• DISTRIBUTED FILE SYSTEM AND ITS ISSUES
• INTRODUCTION TO BIG DATA,
• BIG DATA CHARACTERISTICS
• TYPES OF BIG DATA
• TRADITIONAL VS. BIG DATA APPROACH
• BIG DATA APPLICATIONS
Distributed file system and its issues
Distributed file system and its issues
• A single machine with 4 Hard disks with 1 tb of data (I/O Channel), with 100
Mbps speed. For Processing needs 45 mins.
• For Faster Processing :
• Divide data and store it on multiple machines with same configuration as above
– Assume all machines are processing data in parallel manner then , It will take
45/5 = 9 mins for processing.
• Processing will be 5 times faster than a single machine.
Distributed file system and its issues
Distributed file system and its issues
Each machine have its own local file system (physical file system ) where you store data i.e create folders and
subfolders and so on.
Distributed file system is not physical, it is virtual or logical file system.
Hadoop used DFS.
Install libraries on every machine running as a separate process in different machines.
These are creating virtual layer over the physical file system under it.
This virtual layer is called distributed file system
Distributed File System
Distributed file system and its issues
• Virtual File System is a software i.e set of programs—obviously….Set of
commands
• Ex. Dfs -copy source file destination file
• Dfs -copy file1 file 2
• It read file1 which is distributed on 5 machines say ( A,B,C,D,E ), user having
no idea about it….Where each part of file is ?
• ( path is virtual path) nowhere it is existing.
• Any dfs follows master slave architecture
DFS
Master Machine
Slave Machines
 Upper Machine is Master Machine and Lower 5
are Slave ones.
 Data is splitted and stored on slave machines.
 Master does not store any data. It only stores
metadata.
 Master Machine only know (as File is divided into
blocks (File to Block Mapping and blocks are
distributed on slave machines i.e Block to Slave
mapping)
 Data can only be accessed via Master as Only
Master know the actual location of data on each
slave.
HDFS
• While reading data, if any of the node failure then client may get partial data.
• To overcome this at the time of configuring HDFS, replication factor is set i.e if
replication factor = 2 , it means every block is replicated ( copied at two places) i.e 2
copies are maintained for each block.
• In case of failure of one node, block can be accessed from another node. Data is
transmitted to machine (Server) where program is running.
.
Features of DFS
Transparency :
 Structure transparency –
There is no need for the client to know about the number or locations of file servers and the
storage devices.
 Access transparency –
Both local and remote files should be accessible in the same manner.
 Naming transparency –
Once a name is given to the file, it should not be changed during transferring from one
node to another.
Features of DFS
• Replication transparency –
If a file is copied on multiple nodes, both the copies of the file and their
locations should be hidden from one node to another.
 User mobility :
It will automatically bring the user’s home directory to the node where the
user logs in.
• Performance :
Performance is based on the average amount of time needed to convince the
client requests.
• This time covers the CPU time + time taken to access secondary storage +
network access time.
Features of DFS
 Simplicity and ease of use :
The user interface of a file system should be simple and the number of commands in the file should be
small.
 High availability :
A Distributed File System should be able to continue in case of any partial failures like a link failure, a
node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and independent file servers
for controlling different and independent storage devices.
 Scalability :
Since growing the network by adding new machines or joining two networks together is routine, the
distributed system will inevitably grow over time. As a result, a good distributed file system should be
built to scale quickly as the number of nodes and users in the system grows. Service should not be
substantially disrupted as the number of nodes and users grows.
Features of DFS
 High reliability :
A file system should create backup copies of key files that can be used if the originals are lost.
Many file systems employ stable storage as a high-reliability strategy.
 Data integrity :
 Multiple users frequently share a file system.
 The integrity of data saved in a shared file must be guaranteed by the file system.
 That is, concurrent access requests from many users who are competing for access to the
same file must be correctly synchronized using a concurrency control method.
 Atomic transactions are a high-level concurrency management mechanism for data
integrity that is frequently offered to users by a file system.
Features of DFS
 Security :
Users of heterogeneous distributed systems have the option of using multiple computer
platforms for different purposes.
 Heterogeneity :
 To safeguard the information contained in the file system from unwanted & unauthorized
access, security mechanisms must be implemented.
 A distributed file system should be secure so that its users may trust that their data will be
kept private.
Issues with DFS
 In Distributed File System nodes and connections needs to be secured therefore
we can say that security is at stake.
 There is a possibility of lose of messages and data in the network while movement
from one node to another.
 Database connection in case of Distributed File System is complicated.
 Also handling of the database is not easy in Distributed File System as compared
to a single user system.
 There are chances that overloading will take place if all nodes tries to send data
at once.do with the local
Factors- Big Data Generation
Evolution of Technology
Factors- Big Data Generation
IOT
Factors- Big Data Generation
Social Media
Factors- Big Data Generation
Others
What is Big Data?
Characteristics – Big Data
FIVE V’S OF BIG DATA : 1 . VOLUME
Characteristics – Big Data
FIVE V’S OF BIG DATA : 2. VARIETY
Characteristics – Big Data
FIVE V’S OF BIG DATA : 3 . VELOCITY
Characteristics – Big Data
FIVE V’S OF BIG DATA : 4. VALUE
Characteristics – Big Data
FIVE V’S OF BIG DATA : 4. VERACITY
Characteristics of Big Data at a glance
Types of Big Data
Types of Big Data
• Structured
The structured data includes all the data that can be stored in a tabular column.
Relational databases are examples of structured data.
It is easy to make sense of the relational databases.
Most of the modern computers are able to make sense of structured data.
Types of Big Data
Unstructured
• Unstructured data refers to the data that lacks any specific form or structure whatsoever.
• The unstructured data is the one that cannot be stored in a spreadsheet;
• Unstructured data, on the other hand, is the one which cannot be fit into tabular databases.
• Examples of unstructured data include audio, video, and other sorts of data which comprise such a big chunk
of the big data today. Email is an example of unstructured data.
Types of Big Data
Semi-structured
• The semi-structured data includes both structured and unstructured data.
• This type of data sets include a proper structure, but still it might not be possible
to sort or process that data due to some constraints.
• This type of data includes the XML data, JSON files, and others.
Traditional Vs. Big Data
• 1.Traditional data
• Traditional data is the structured data which is being majorly maintained by all types of businesses
starting from very small to big organizations.
• In traditional database system a centralized database architecture used to store and maintain the data in
a fixed format or fields in a file.
• For managing and accessing the data structured query language (SQL) is used.
• 2. Big data :
Big data deal with too large or complex data sets which is difficult to manage in traditional data-processing
application software.
• It deals with large volume of both structured, semi structured and unstructured data. Volume, velocity and
variety, veracity and value.
• Big data not only refers to large amount of data it refers to extracting meaningful data by analyzing the
huge amount of complex data sets.
S.No. TRADITIONAL DATA BIG DATA
01. Traditional data is generated in enterprise level. Big data is generated in outside and enterprise level.
02. Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to Zettabytes or Exabytes.
03. Traditional database system deals with structured data. Big data system deals with structured, semi structured and unstructured data.
04. Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds.
05.
Traditional data source is centralized and it is managed in centralized
form. Big data source is distributed and it is managed in distributed form.
06. Data integration is very easy. Data integration is very difficult.
07. Normal system configuration is capable to process traditional data. High system configuration is required to process big data.
08. The size of the data is very small. The size is more than the traditional data size.
09.
Traditional data base tools are required to perform any data base
operation. Special kind of data base tools are required to perform any data base operation.
10. Normal functions can manipulate data. Special kind of functions can manipulate data.
11. Its data model is strict schema based and it is static. Its data model is flat schema based and it is dynamic.
12.. Traditional data is stable and inter relationship. Big data is not stable and unknown relationship.
13. Traditional data is in manageable volume. Big data is in huge volume which becomes unmanageable.
14. It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data.
15.
Its data sources includes ERP transaction data, CRM transaction data,
financial data, organizational data, web transaction data etc. Its data sources includes social media, device data, sensor data, video, images, audio etc.
Applications of Big Data
•Big data in retail
•Big data in healthcare
•Big data in education
•Big data in e-commerce
•Big data in media and entertainment
•Big data in finance
•Big data in travel industry
•Big data in telecom
•Big data in automobile
Applications of Big Data
Ad

More Related Content

What's hot (20)

Cp 121 lecture 01
Cp 121 lecture 01Cp 121 lecture 01
Cp 121 lecture 01
ITNet
 
Introduction & history of dbms
Introduction & history of dbmsIntroduction & history of dbms
Introduction & history of dbms
sethu pm
 
Distributed Database
Distributed DatabaseDistributed Database
Distributed Database
Amity University | FMS - DU | IMT | Stratford University | KKMI International Institute | AIMA | DTU
 
Trends in the Database
Trends in the DatabaseTrends in the Database
Trends in the Database
Marlon Jamera
 
Chapter 01 Fundamental of Database Management System (DBMS)
Chapter 01  Fundamental of Database Management System (DBMS)Chapter 01  Fundamental of Database Management System (DBMS)
Chapter 01 Fundamental of Database Management System (DBMS)
Abdurehman Mahmud
 
CS3270 - DATABASE SYSTEM - Lecture (1)
CS3270 - DATABASE SYSTEM -  Lecture (1)CS3270 - DATABASE SYSTEM -  Lecture (1)
CS3270 - DATABASE SYSTEM - Lecture (1)
Dilawar Khan
 
Database assignment
Database assignmentDatabase assignment
Database assignment
HudiKhatib
 
Database Management Systems - Management Information System
Database Management Systems - Management Information SystemDatabase Management Systems - Management Information System
Database Management Systems - Management Information System
Nijaz N
 
1 introduction ddbms
1 introduction ddbms1 introduction ddbms
1 introduction ddbms
amna izzat
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
Information Technology
 
SULTHAN's ICT-2 for UG courses
SULTHAN's ICT-2 for UG coursesSULTHAN's ICT-2 for UG courses
SULTHAN's ICT-2 for UG courses
SULTHAN BASHA
 
Distributed web based systems
Distributed web based systemsDistributed web based systems
Distributed web based systems
Reza Gh
 
Database Systems
Database SystemsDatabase Systems
Database Systems
Sheikh Hussnain
 
Emerging database technology multimedia database
Emerging database technology   multimedia databaseEmerging database technology   multimedia database
Emerging database technology multimedia database
Salama Al Busaidi
 
Lecture 3 multimedia databases
Lecture 3   multimedia databasesLecture 3   multimedia databases
Lecture 3 multimedia databases
Ranjana N Jinde
 
Trends in Database Management
Trends in Database ManagementTrends in Database Management
Trends in Database Management
Marlon Jamera
 
Introduction to Data Management and Sharing
Introduction to Data Management and SharingIntroduction to Data Management and Sharing
Introduction to Data Management and Sharing
Columbia Unviersity Scholarly Communication Program
 
Multimedia Database
Multimedia DatabaseMultimedia Database
Multimedia Database
shaikh2016
 
DBMS FOR STUDENTS MUST DOWNLOAD AND READ
DBMS FOR STUDENTS MUST DOWNLOAD AND READDBMS FOR STUDENTS MUST DOWNLOAD AND READ
DBMS FOR STUDENTS MUST DOWNLOAD AND READ
amitp26
 
Database systems
Database systemsDatabase systems
Database systems
NazmulHossen5
 
Cp 121 lecture 01
Cp 121 lecture 01Cp 121 lecture 01
Cp 121 lecture 01
ITNet
 
Introduction & history of dbms
Introduction & history of dbmsIntroduction & history of dbms
Introduction & history of dbms
sethu pm
 
Trends in the Database
Trends in the DatabaseTrends in the Database
Trends in the Database
Marlon Jamera
 
Chapter 01 Fundamental of Database Management System (DBMS)
Chapter 01  Fundamental of Database Management System (DBMS)Chapter 01  Fundamental of Database Management System (DBMS)
Chapter 01 Fundamental of Database Management System (DBMS)
Abdurehman Mahmud
 
CS3270 - DATABASE SYSTEM - Lecture (1)
CS3270 - DATABASE SYSTEM -  Lecture (1)CS3270 - DATABASE SYSTEM -  Lecture (1)
CS3270 - DATABASE SYSTEM - Lecture (1)
Dilawar Khan
 
Database assignment
Database assignmentDatabase assignment
Database assignment
HudiKhatib
 
Database Management Systems - Management Information System
Database Management Systems - Management Information SystemDatabase Management Systems - Management Information System
Database Management Systems - Management Information System
Nijaz N
 
1 introduction ddbms
1 introduction ddbms1 introduction ddbms
1 introduction ddbms
amna izzat
 
SULTHAN's ICT-2 for UG courses
SULTHAN's ICT-2 for UG coursesSULTHAN's ICT-2 for UG courses
SULTHAN's ICT-2 for UG courses
SULTHAN BASHA
 
Distributed web based systems
Distributed web based systemsDistributed web based systems
Distributed web based systems
Reza Gh
 
Emerging database technology multimedia database
Emerging database technology   multimedia databaseEmerging database technology   multimedia database
Emerging database technology multimedia database
Salama Al Busaidi
 
Lecture 3 multimedia databases
Lecture 3   multimedia databasesLecture 3   multimedia databases
Lecture 3 multimedia databases
Ranjana N Jinde
 
Trends in Database Management
Trends in Database ManagementTrends in Database Management
Trends in Database Management
Marlon Jamera
 
Multimedia Database
Multimedia DatabaseMultimedia Database
Multimedia Database
shaikh2016
 
DBMS FOR STUDENTS MUST DOWNLOAD AND READ
DBMS FOR STUDENTS MUST DOWNLOAD AND READDBMS FOR STUDENTS MUST DOWNLOAD AND READ
DBMS FOR STUDENTS MUST DOWNLOAD AND READ
amitp26
 

Similar to Data Analytics: HDFS with Big Data : Issues and Application (20)

UNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfUNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdf
ShitalGhotekar
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Resources security and protection Distributed operating system
Resources security and protection Distributed operating systemResources security and protection Distributed operating system
Resources security and protection Distributed operating system
jeyashri337
 
Introduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud ComputingIntroduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud Computing
Rutuja751147
 
advanced database management system by uni
advanced database management system by uniadvanced database management system by uni
advanced database management system by uni
VaibhavSrivastav52
 
chapter-1Introduction to DS,Issues and Architecture.pptx
chapter-1Introduction to DS,Issues and Architecture.pptxchapter-1Introduction to DS,Issues and Architecture.pptx
chapter-1Introduction to DS,Issues and Architecture.pptx
ARULMURUGANRAMU1
 
Database Management system intro.pptx
Database  Management  system  intro.pptxDatabase  Management  system  intro.pptx
Database Management system intro.pptx
sivamathi12
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
WasyihunSema2
 
Database-Management-System - Topic Data Models
Database-Management-System - Topic Data ModelsDatabase-Management-System - Topic Data Models
Database-Management-System - Topic Data Models
MadhavSilwal1
 
DBMS basics and normalizations unit.pptx
DBMS basics and normalizations unit.pptxDBMS basics and normalizations unit.pptx
DBMS basics and normalizations unit.pptx
shreyassoni7
 
FILE SYSTEM VS DBMS ppt.pptx
FILE SYSTEM VS DBMS ppt.pptxFILE SYSTEM VS DBMS ppt.pptx
FILE SYSTEM VS DBMS ppt.pptx
SakshiRawat394090
 
An Introduction to Database systems.pptx
An Introduction to Database systems.pptxAn Introduction to Database systems.pptx
An Introduction to Database systems.pptx
niqqaanonymous211
 
Database Systems Lec 1.pptx
Database Systems Lec 1.pptxDatabase Systems Lec 1.pptx
Database Systems Lec 1.pptx
NishaTariq1
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
System Analysis And Design
System Analysis And DesignSystem Analysis And Design
System Analysis And Design
Lijo Stalin
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
DignitasDigital1
 
DBMS.pptx
DBMS.pptxDBMS.pptx
DBMS.pptx
SIMNchannel
 
Distributed Storage in advanced database.pptx
Distributed Storage in advanced database.pptxDistributed Storage in advanced database.pptx
Distributed Storage in advanced database.pptx
rojansebastian1
 
Chapter-5-DFS.ppt
Chapter-5-DFS.pptChapter-5-DFS.ppt
Chapter-5-DFS.ppt
rameshwarchintamani
 
Santosh Kumar Meher(2105040008) DISTRIBUTED DATABASE.pptx
Santosh Kumar Meher(2105040008) DISTRIBUTED DATABASE.pptxSantosh Kumar Meher(2105040008) DISTRIBUTED DATABASE.pptx
Santosh Kumar Meher(2105040008) DISTRIBUTED DATABASE.pptx
SANTOSH KUMAR MEHER
 
UNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfUNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdf
ShitalGhotekar
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Resources security and protection Distributed operating system
Resources security and protection Distributed operating systemResources security and protection Distributed operating system
Resources security and protection Distributed operating system
jeyashri337
 
Introduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud ComputingIntroduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud Computing
Rutuja751147
 
advanced database management system by uni
advanced database management system by uniadvanced database management system by uni
advanced database management system by uni
VaibhavSrivastav52
 
chapter-1Introduction to DS,Issues and Architecture.pptx
chapter-1Introduction to DS,Issues and Architecture.pptxchapter-1Introduction to DS,Issues and Architecture.pptx
chapter-1Introduction to DS,Issues and Architecture.pptx
ARULMURUGANRAMU1
 
Database Management system intro.pptx
Database  Management  system  intro.pptxDatabase  Management  system  intro.pptx
Database Management system intro.pptx
sivamathi12
 
Database-Management-System - Topic Data Models
Database-Management-System - Topic Data ModelsDatabase-Management-System - Topic Data Models
Database-Management-System - Topic Data Models
MadhavSilwal1
 
DBMS basics and normalizations unit.pptx
DBMS basics and normalizations unit.pptxDBMS basics and normalizations unit.pptx
DBMS basics and normalizations unit.pptx
shreyassoni7
 
FILE SYSTEM VS DBMS ppt.pptx
FILE SYSTEM VS DBMS ppt.pptxFILE SYSTEM VS DBMS ppt.pptx
FILE SYSTEM VS DBMS ppt.pptx
SakshiRawat394090
 
An Introduction to Database systems.pptx
An Introduction to Database systems.pptxAn Introduction to Database systems.pptx
An Introduction to Database systems.pptx
niqqaanonymous211
 
Database Systems Lec 1.pptx
Database Systems Lec 1.pptxDatabase Systems Lec 1.pptx
Database Systems Lec 1.pptx
NishaTariq1
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
System Analysis And Design
System Analysis And DesignSystem Analysis And Design
System Analysis And Design
Lijo Stalin
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
DignitasDigital1
 
Distributed Storage in advanced database.pptx
Distributed Storage in advanced database.pptxDistributed Storage in advanced database.pptx
Distributed Storage in advanced database.pptx
rojansebastian1
 
Santosh Kumar Meher(2105040008) DISTRIBUTED DATABASE.pptx
Santosh Kumar Meher(2105040008) DISTRIBUTED DATABASE.pptxSantosh Kumar Meher(2105040008) DISTRIBUTED DATABASE.pptx
Santosh Kumar Meher(2105040008) DISTRIBUTED DATABASE.pptx
SANTOSH KUMAR MEHER
 
Ad

Recently uploaded (20)

Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhhChapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
ChrisjohnAlfiler
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Process Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBSProcess Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBS
Process mining Evangelist
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 
real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
Collibra DQ Installation setup and debug
Collibra DQ Installation setup and debugCollibra DQ Installation setup and debug
Collibra DQ Installation setup and debug
karthikprince20
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest InsurerSuncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Process mining Evangelist
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
spssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptxspssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptx
clarkraal
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhhChapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
ChrisjohnAlfiler
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Process Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBSProcess Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBS
Process mining Evangelist
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 
Collibra DQ Installation setup and debug
Collibra DQ Installation setup and debugCollibra DQ Installation setup and debug
Collibra DQ Installation setup and debug
karthikprince20
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest InsurerSuncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Process mining Evangelist
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
spssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptxspssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptx
clarkraal
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
 
Ad

Data Analytics: HDFS with Big Data : Issues and Application

  • 1. Data Analytics A Lecture Series on Dr.Chitra A.Dhawale P.R.Pote College of Engg.and Mgmt.
  • 3. Data Analytics (MCA19304) COURSE OUTCOMES AT THE END OF COURSE THE STUDENT SHOULD BE ABLE TO : 1. DEVELOP AND MAINTAIN RELIABLE, SCALABLE SYSTEMS USING APACHE, HADOOP 2. WRITE MAP REDUCE BASED APPLICATION 3. DIFFERENTIATE BETWEEN CONVENTIONAL SQL AND NOSQL 4. ANALYZE AND DEVELOP BIG DATA SOLUTIONS USING HIVE AND PIG
  • 4. Data Analytics (MCA19304) UNIT I • DISTRIBUTED FILE SYSTEM AND ITS ISSUES • INTRODUCTION TO BIG DATA, • BIG DATA CHARACTERISTICS • TYPES OF BIG DATA • TRADITIONAL VS. BIG DATA APPROACH • BIG DATA APPLICATIONS
  • 5. Distributed file system and its issues
  • 6. Distributed file system and its issues • A single machine with 4 Hard disks with 1 tb of data (I/O Channel), with 100 Mbps speed. For Processing needs 45 mins. • For Faster Processing : • Divide data and store it on multiple machines with same configuration as above – Assume all machines are processing data in parallel manner then , It will take 45/5 = 9 mins for processing. • Processing will be 5 times faster than a single machine.
  • 7. Distributed file system and its issues
  • 8. Distributed file system and its issues Each machine have its own local file system (physical file system ) where you store data i.e create folders and subfolders and so on. Distributed file system is not physical, it is virtual or logical file system. Hadoop used DFS. Install libraries on every machine running as a separate process in different machines. These are creating virtual layer over the physical file system under it. This virtual layer is called distributed file system Distributed File System
  • 9. Distributed file system and its issues • Virtual File System is a software i.e set of programs—obviously….Set of commands • Ex. Dfs -copy source file destination file • Dfs -copy file1 file 2 • It read file1 which is distributed on 5 machines say ( A,B,C,D,E ), user having no idea about it….Where each part of file is ? • ( path is virtual path) nowhere it is existing. • Any dfs follows master slave architecture
  • 10. DFS Master Machine Slave Machines  Upper Machine is Master Machine and Lower 5 are Slave ones.  Data is splitted and stored on slave machines.  Master does not store any data. It only stores metadata.  Master Machine only know (as File is divided into blocks (File to Block Mapping and blocks are distributed on slave machines i.e Block to Slave mapping)  Data can only be accessed via Master as Only Master know the actual location of data on each slave.
  • 11. HDFS • While reading data, if any of the node failure then client may get partial data. • To overcome this at the time of configuring HDFS, replication factor is set i.e if replication factor = 2 , it means every block is replicated ( copied at two places) i.e 2 copies are maintained for each block. • In case of failure of one node, block can be accessed from another node. Data is transmitted to machine (Server) where program is running. .
  • 12. Features of DFS Transparency :  Structure transparency – There is no need for the client to know about the number or locations of file servers and the storage devices.  Access transparency – Both local and remote files should be accessible in the same manner.  Naming transparency – Once a name is given to the file, it should not be changed during transferring from one node to another.
  • 13. Features of DFS • Replication transparency – If a file is copied on multiple nodes, both the copies of the file and their locations should be hidden from one node to another.  User mobility : It will automatically bring the user’s home directory to the node where the user logs in. • Performance : Performance is based on the average amount of time needed to convince the client requests. • This time covers the CPU time + time taken to access secondary storage + network access time.
  • 14. Features of DFS  Simplicity and ease of use : The user interface of a file system should be simple and the number of commands in the file should be small.  High availability : A Distributed File System should be able to continue in case of any partial failures like a link failure, a node failure, or a storage drive crash. A high authentic and adaptable distributed file system should have different and independent file servers for controlling different and independent storage devices.  Scalability : Since growing the network by adding new machines or joining two networks together is routine, the distributed system will inevitably grow over time. As a result, a good distributed file system should be built to scale quickly as the number of nodes and users in the system grows. Service should not be substantially disrupted as the number of nodes and users grows.
  • 15. Features of DFS  High reliability : A file system should create backup copies of key files that can be used if the originals are lost. Many file systems employ stable storage as a high-reliability strategy.  Data integrity :  Multiple users frequently share a file system.  The integrity of data saved in a shared file must be guaranteed by the file system.  That is, concurrent access requests from many users who are competing for access to the same file must be correctly synchronized using a concurrency control method.  Atomic transactions are a high-level concurrency management mechanism for data integrity that is frequently offered to users by a file system.
  • 16. Features of DFS  Security : Users of heterogeneous distributed systems have the option of using multiple computer platforms for different purposes.  Heterogeneity :  To safeguard the information contained in the file system from unwanted & unauthorized access, security mechanisms must be implemented.  A distributed file system should be secure so that its users may trust that their data will be kept private.
  • 17. Issues with DFS  In Distributed File System nodes and connections needs to be secured therefore we can say that security is at stake.  There is a possibility of lose of messages and data in the network while movement from one node to another.  Database connection in case of Distributed File System is complicated.  Also handling of the database is not easy in Distributed File System as compared to a single user system.  There are chances that overloading will take place if all nodes tries to send data at once.do with the local
  • 18. Factors- Big Data Generation Evolution of Technology
  • 19. Factors- Big Data Generation IOT
  • 20. Factors- Big Data Generation Social Media
  • 21. Factors- Big Data Generation Others
  • 22. What is Big Data?
  • 23. Characteristics – Big Data FIVE V’S OF BIG DATA : 1 . VOLUME
  • 24. Characteristics – Big Data FIVE V’S OF BIG DATA : 2. VARIETY
  • 25. Characteristics – Big Data FIVE V’S OF BIG DATA : 3 . VELOCITY
  • 26. Characteristics – Big Data FIVE V’S OF BIG DATA : 4. VALUE
  • 27. Characteristics – Big Data FIVE V’S OF BIG DATA : 4. VERACITY
  • 28. Characteristics of Big Data at a glance
  • 29. Types of Big Data
  • 30. Types of Big Data • Structured The structured data includes all the data that can be stored in a tabular column. Relational databases are examples of structured data. It is easy to make sense of the relational databases. Most of the modern computers are able to make sense of structured data.
  • 31. Types of Big Data Unstructured • Unstructured data refers to the data that lacks any specific form or structure whatsoever. • The unstructured data is the one that cannot be stored in a spreadsheet; • Unstructured data, on the other hand, is the one which cannot be fit into tabular databases. • Examples of unstructured data include audio, video, and other sorts of data which comprise such a big chunk of the big data today. Email is an example of unstructured data.
  • 32. Types of Big Data Semi-structured • The semi-structured data includes both structured and unstructured data. • This type of data sets include a proper structure, but still it might not be possible to sort or process that data due to some constraints. • This type of data includes the XML data, JSON files, and others.
  • 33. Traditional Vs. Big Data • 1.Traditional data • Traditional data is the structured data which is being majorly maintained by all types of businesses starting from very small to big organizations. • In traditional database system a centralized database architecture used to store and maintain the data in a fixed format or fields in a file. • For managing and accessing the data structured query language (SQL) is used. • 2. Big data : Big data deal with too large or complex data sets which is difficult to manage in traditional data-processing application software. • It deals with large volume of both structured, semi structured and unstructured data. Volume, velocity and variety, veracity and value. • Big data not only refers to large amount of data it refers to extracting meaningful data by analyzing the huge amount of complex data sets.
  • 34. S.No. TRADITIONAL DATA BIG DATA 01. Traditional data is generated in enterprise level. Big data is generated in outside and enterprise level. 02. Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to Zettabytes or Exabytes. 03. Traditional database system deals with structured data. Big data system deals with structured, semi structured and unstructured data. 04. Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds. 05. Traditional data source is centralized and it is managed in centralized form. Big data source is distributed and it is managed in distributed form. 06. Data integration is very easy. Data integration is very difficult. 07. Normal system configuration is capable to process traditional data. High system configuration is required to process big data.
  • 35. 08. The size of the data is very small. The size is more than the traditional data size. 09. Traditional data base tools are required to perform any data base operation. Special kind of data base tools are required to perform any data base operation. 10. Normal functions can manipulate data. Special kind of functions can manipulate data. 11. Its data model is strict schema based and it is static. Its data model is flat schema based and it is dynamic. 12.. Traditional data is stable and inter relationship. Big data is not stable and unknown relationship. 13. Traditional data is in manageable volume. Big data is in huge volume which becomes unmanageable. 14. It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data. 15. Its data sources includes ERP transaction data, CRM transaction data, financial data, organizational data, web transaction data etc. Its data sources includes social media, device data, sensor data, video, images, audio etc.
  • 36. Applications of Big Data •Big data in retail •Big data in healthcare •Big data in education •Big data in e-commerce •Big data in media and entertainment •Big data in finance •Big data in travel industry •Big data in telecom •Big data in automobile