SlideShare a Scribd company logo
BigDataProcessing
usingHadoop
What is Big Data?
•Data which is very large in size and yet growing
exponentially with time is called as Big data.
Why we need Big Data?
• For any application that contains limited amount of data we normally
use SQL / PostgreSQL / Oracle / MySQL.
• What in case of large applications like Facebook, Google, YouTube?
• This data is so large and complex that none of the traditional data
management system is able to store and process it.
• Facebook generates 500+ TB data per day.
Where does Big Data come from?
•Social data
•Share Market
•E-commerce site
•Airplane
What is the need for storing such
huge amount of data?
• The main reason behind storing data is analysis.
• Data analysis
• More accurate analyses
Example
• When we search anything on e-commerce websites(eBay,
Amazon), we get some recommendations of product that we
search.
• Similarly, why Facebook stores our images ,videos?
• a) Global marketing
• b) Target marketing
Categories Of Big Data
• Structured Data:
• Example :Relational database management system.
• Unstructured Data:
• Example Text files ,images ,videos ,email ,customer service
interactions ,webpages ,PDF files ,PPT ,social media data etc.
Big Data and Hadoop
What is Hadoop?
• Apache Hadoop is an open-source java framework that is used to
store ,analyze and process Big data in a distributed systems
environment across clusters of computers.
• It is used by Google, Facebook, Yahoo, YouTube, Twitter, LinkedIn and
many more.
• It is inspired by google MapReduce algorithm for running distributed
application.
• Hadoop is written in java programming language with some code in C
and some commands in shell-scripts.
Continued...
•Hadoop is not a database:
Architecture of Hadoop:
• Hadoop follows master-slave architecture
• Two important components of Hadoop are:
• 1.HDFS (Data storage)
• 2.Map-Reduce (Analyzing and Processing)
Big Data and Hadoop
Hadoop Distributed File System
(HDFS)
• Hadoop Distributed File System is distributed file system
used to store very huge amount of data.
• HDFS follows master-slave architecture means there is
one master machine(Name Node) and multiple slave
machines(Data Node).
• The data that you give to hadoop is stored across these
machines in the cluster.
Big Data and Hadoop
Various components of HDFS
• Blocks:
• Name Node:
• Data Node:
• Secondary Name Node:
Map reduce
• MapReduce is a framework and processing technique using which we can
write applications to process huge amounts of data, in parallel, on large
clusters of commodity hardware in a reliable manner.
• Map reduce program is written by default in java but we can use other
language also like pig, apache pig.
• The Map Reduce algorithm contains two important tasks:
• 1) Map
• 2) Reduce
Map
• Mapper takes a set of data and converts it into another set
of data, where individual elements are broken down into
tuples (key/value pairs).
Reduce
• Reducer takes the output from a map as input and
combines those data tuples into a smaller set of tuples.
• The reduce job is always performed after the map job.
Map Reduce Example
Components of MapReduce
• JobTracker:
• TaskTracker:
JobTracker process
• JobTracker receives the requests from the client.
• JobTracker talks to the NameNode to determine the location of the
data.
• JobTracker finds the best TaskTracker nodes to execute tasks.
• The JobTracker submits the work to the chosen TaskTracker nodes.
• The TaskTracker nodes are monitored. If they do not submit heartbeat
signals then work is scheduled on a different TaskTracker.
• When the work is completed, the JobTracker updates its status and
submit back the overall status of job back to the client.
• The JobTracker is a point of failure for the Hadoop MapReduce service.
If it goes down, all running jobs are halted.
TaskTracker process
• The JobTracker submits the work to the TaskTracker nodes.
• TaskTracker run the tasks and report the status of task to
JobTracker.
• It has function of following the orders of the job tracker and
updating the job tracker with its progress status periodically.
• TaskTracker will be in constant communication with the
JobTracker.
• TaskTracker failure is not considered fatal. When a TaskTracker
becomes unresponsive, JobTracker will assign the task to another
node.
Big Data and Hadoop
Sqoop
• Sqoop is an open source framework provided by Apache.
• Used to import/export data to/from relational databases.
Why Sqoop is used?
• For Hadoop developers, the work starts after data is loaded
into HDFS.
• The data residing in the RDBMS need to be transferred to
HDFS and might need to transfer back to RDBMS.
• Sqoop uses MapReduce framework to import and export
the data, which provides parallel mechanism as well as fault
tolerance.
• Sqoop makes developers life easy by providing command
line interface.
• Developers just need to provide basic information like
source, destination and database authentication details in
the sqoop command.

More Related Content

What's hot (20)

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
Databricks
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
Mike Frampton
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Databricks
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
Sofian Hadiwijaya
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
Gang Tao
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
Itai Yaffe
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 
WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS? WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS?
nakshatraL
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
DataKitchen
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
Qubole
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
Avery Ching
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
joshwills
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
Databricks
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
Mike Frampton
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Databricks
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
Gang Tao
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
Itai Yaffe
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 
WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS? WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS?
nakshatraL
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
DataKitchen
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
Qubole
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
Avery Ching
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
joshwills
 

Similar to Big Data and Hadoop (20)

Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
Hadoop and MapReduce Introductort presentation
Hadoop and MapReduce Introductort presentationHadoop and MapReduce Introductort presentation
Hadoop and MapReduce Introductort presentation
ssuserb91a20
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
pavan penugonda
 
MODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptxMODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCEHADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
Harsha Siva Sai
 
Data Analytics and IoT, how to analyze data from IoT
Data Analytics and IoT, how to analyze data from IoTData Analytics and IoT, how to analyze data from IoT
Data Analytics and IoT, how to analyze data from IoT
AmmarHassan80
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
Umair Shafique
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Big data technology
Big data technology Big data technology
Big data technology
omer mohamed abd alrhman
 
the mapreduce programming paradigm in cybersecurity
the  mapreduce programming paradigm in cybersecuritythe  mapreduce programming paradigm in cybersecurity
the mapreduce programming paradigm in cybersecurity
xawomi1686
 
Big data applications
Big data applicationsBig data applications
Big data applications
Juan Pablo Paz Grau, Ph.D., PMP
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli
 
Hadoop and It_s Components_PPT .pptx
Hadoop and It_s Components_PPT      .pptxHadoop and It_s Components_PPT      .pptx
Hadoop and It_s Components_PPT .pptx
ABHIJEETKUMAR632313
 
IoT ppts unit 5.gfajfajfgajfgagfafaffpptx
IoT ppts unit 5.gfajfajfgajfgagfafaffpptxIoT ppts unit 5.gfajfajfgajfgagfafaffpptx
IoT ppts unit 5.gfajfajfgajfgagfafaffpptx
rajukolluri
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
Hadoop and MapReduce Introductort presentation
Hadoop and MapReduce Introductort presentationHadoop and MapReduce Introductort presentation
Hadoop and MapReduce Introductort presentation
ssuserb91a20
 
MODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptxMODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCEHADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
Harsha Siva Sai
 
Data Analytics and IoT, how to analyze data from IoT
Data Analytics and IoT, how to analyze data from IoTData Analytics and IoT, how to analyze data from IoT
Data Analytics and IoT, how to analyze data from IoT
AmmarHassan80
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
Umair Shafique
 
the mapreduce programming paradigm in cybersecurity
the  mapreduce programming paradigm in cybersecuritythe  mapreduce programming paradigm in cybersecurity
the mapreduce programming paradigm in cybersecurity
xawomi1686
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli
 
Hadoop and It_s Components_PPT .pptx
Hadoop and It_s Components_PPT      .pptxHadoop and It_s Components_PPT      .pptx
Hadoop and It_s Components_PPT .pptx
ABHIJEETKUMAR632313
 
IoT ppts unit 5.gfajfajfgajfgagfafaffpptx
IoT ppts unit 5.gfajfajfgajfgagfafaffpptxIoT ppts unit 5.gfajfajfgajfgagfafaffpptx
IoT ppts unit 5.gfajfajfgajfgagfafaffpptx
rajukolluri
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 

Recently uploaded (20)

AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 

Big Data and Hadoop

  • 2. What is Big Data? •Data which is very large in size and yet growing exponentially with time is called as Big data.
  • 3. Why we need Big Data? • For any application that contains limited amount of data we normally use SQL / PostgreSQL / Oracle / MySQL. • What in case of large applications like Facebook, Google, YouTube? • This data is so large and complex that none of the traditional data management system is able to store and process it. • Facebook generates 500+ TB data per day.
  • 4. Where does Big Data come from? •Social data •Share Market •E-commerce site •Airplane
  • 5. What is the need for storing such huge amount of data? • The main reason behind storing data is analysis. • Data analysis • More accurate analyses
  • 6. Example • When we search anything on e-commerce websites(eBay, Amazon), we get some recommendations of product that we search. • Similarly, why Facebook stores our images ,videos? • a) Global marketing • b) Target marketing
  • 7. Categories Of Big Data • Structured Data: • Example :Relational database management system. • Unstructured Data: • Example Text files ,images ,videos ,email ,customer service interactions ,webpages ,PDF files ,PPT ,social media data etc.
  • 9. What is Hadoop? • Apache Hadoop is an open-source java framework that is used to store ,analyze and process Big data in a distributed systems environment across clusters of computers. • It is used by Google, Facebook, Yahoo, YouTube, Twitter, LinkedIn and many more. • It is inspired by google MapReduce algorithm for running distributed application. • Hadoop is written in java programming language with some code in C and some commands in shell-scripts.
  • 11. Architecture of Hadoop: • Hadoop follows master-slave architecture • Two important components of Hadoop are: • 1.HDFS (Data storage) • 2.Map-Reduce (Analyzing and Processing)
  • 13. Hadoop Distributed File System (HDFS) • Hadoop Distributed File System is distributed file system used to store very huge amount of data. • HDFS follows master-slave architecture means there is one master machine(Name Node) and multiple slave machines(Data Node). • The data that you give to hadoop is stored across these machines in the cluster.
  • 15. Various components of HDFS • Blocks: • Name Node: • Data Node: • Secondary Name Node:
  • 16. Map reduce • MapReduce is a framework and processing technique using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. • Map reduce program is written by default in java but we can use other language also like pig, apache pig. • The Map Reduce algorithm contains two important tasks: • 1) Map • 2) Reduce
  • 17. Map • Mapper takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
  • 18. Reduce • Reducer takes the output from a map as input and combines those data tuples into a smaller set of tuples. • The reduce job is always performed after the map job.
  • 20. Components of MapReduce • JobTracker: • TaskTracker:
  • 21. JobTracker process • JobTracker receives the requests from the client. • JobTracker talks to the NameNode to determine the location of the data. • JobTracker finds the best TaskTracker nodes to execute tasks. • The JobTracker submits the work to the chosen TaskTracker nodes. • The TaskTracker nodes are monitored. If they do not submit heartbeat signals then work is scheduled on a different TaskTracker. • When the work is completed, the JobTracker updates its status and submit back the overall status of job back to the client. • The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.
  • 22. TaskTracker process • The JobTracker submits the work to the TaskTracker nodes. • TaskTracker run the tasks and report the status of task to JobTracker. • It has function of following the orders of the job tracker and updating the job tracker with its progress status periodically. • TaskTracker will be in constant communication with the JobTracker. • TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task to another node.
  • 24. Sqoop • Sqoop is an open source framework provided by Apache. • Used to import/export data to/from relational databases.
  • 25. Why Sqoop is used? • For Hadoop developers, the work starts after data is loaded into HDFS. • The data residing in the RDBMS need to be transferred to HDFS and might need to transfer back to RDBMS. • Sqoop uses MapReduce framework to import and export the data, which provides parallel mechanism as well as fault tolerance. • Sqoop makes developers life easy by providing command line interface. • Developers just need to provide basic information like source, destination and database authentication details in the sqoop command.