Data Science Pipeline and Hadoop Ecosystem

The document discusses the data science pipeline and Hadoop ecosystem. It describes the data science pipeline process which includes obtaining data, cleaning data, exploratory data analysis, modeling data, and interpreting data. It also discusses the OSEMN framework which follows a similar process of obtain, scrub, explore, model, and interpret. The document then explains the major components of the Hadoop ecosystem including HDFS for distributed storage, MapReduce for distributed processing, YARN for resource management, and common utilities.

Uploaded by

Shiv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views8 pages

Data Science Pipeline and Hadoop Ecosystem

Uploaded by

Shiv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 8

DATA SCIENCE PIPELINE AND HADOOP

ECOSYSTEM

AND ALSO INCLUDE ABOUT

DABL
DATA SCIENCE PIPELINE
• In simple words, a pipeline in data science is “a set of actions which
changes the raw (and confusing) data from various sources (surveys,
feedbacks, list of purchases, votes, etc.), to an understandable format
so that we can store it and use it for analysis.”
PROCESS OF DATA SCIENCE PIPELINE
• Fetching/ Obtaining the data.
• Scrubbing/ Cleaning the data.
• Exploratory Data Analysis.
• Modelling the Data.
• Interpreting the data.
THE OSEMN FRAMEWORK
PROCESS OF OSEMN FRAMWORK
• Obtain the data : we obtain the data from different data sources.
• Scrub Data : After obtaining data, the next immediate thing to do is
scrubbing data. This process is for us to “clean” and to filter the data.
• Explore data : Once your data is ready to be used, and right before you
jump into AI and Machine Learning, you will have to examine the
data.
• Model Data : This is the stage where most people consider interesting.
As many people call it “where the magic happens”.
• Interpreting Data : Interpreting data refers to the presentation of your
data to a non-technical layman.
HADOOP ECO-SYSTEM
• Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems.
• It includes Apache projects and various commercial tools and
solutions.
• There are four major elements of Hadoop i.e. HDFS, MapReduce,
YARN, and Hadoop Common.
• Most of the tools or solutions are used to supplement or support these major elements.
• All these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
• HDFS : HDFS is a distributed file system that handles large data sets running on commodity
hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of
nodes.
• MAP-REDUCE : MapReduce is a programming model for writing applications that can process
Big Data in parallel on multiple nodes. MapReduce provides analytical capabilities for analyzing
huge volumes of complex data.
• YARN : YARN is a large-scale, distributed operating system for big data applications. The
technology is designed for cluster management and is one of the key features in the second
generation of Hadoop, the Apache Software Foundation's open source distributed processing
framework.
• HADOOP COMMON : Hadoop Common refers to the collection of common utilities and libraries
that support other Hadoop modules. It is an essential part or module of the Apache Hadoop
Framework, along with the Hadoop Distributed File System (HDFS), Hadoop YARN and Hadoop
MapReduce.

Checklist For Testing & Commissioning of Sprinkler System
90% (10)
Checklist For Testing & Commissioning of Sprinkler System
2 pages
(Ebooks PDF) Download Case Studies in Geospatial Applications To Groundwater Resources Pravat Kumar Shit Full Chapters
No ratings yet
(Ebooks PDF) Download Case Studies in Geospatial Applications To Groundwater Resources Pravat Kumar Shit Full Chapters
57 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Data Analytics and Hadoop
No ratings yet
Data Analytics and Hadoop
21 pages
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
No ratings yet
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
44 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
Unit 3
No ratings yet
Unit 3
12 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
ICAI_2023_paper_3719
No ratings yet
ICAI_2023_paper_3719
6 pages
BDA Question Bank
No ratings yet
BDA Question Bank
10 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
unit 2
No ratings yet
unit 2
9 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
BDA Notes Unit-2
No ratings yet
BDA Notes Unit-2
27 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
hadoop ecosystem-converted
No ratings yet
hadoop ecosystem-converted
5 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
BDP UNIT 4
No ratings yet
BDP UNIT 4
28 pages
L-2
No ratings yet
L-2
5 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Unit 4
No ratings yet
Unit 4
4 pages
Bda QB Soln
No ratings yet
Bda QB Soln
22 pages
Hadoop Ecosystem Short NotesTSpdf-1
No ratings yet
Hadoop Ecosystem Short NotesTSpdf-1
4 pages
BDP unit 3
No ratings yet
BDP unit 3
20 pages
M5
No ratings yet
M5
18 pages
Hadoop in bigdata processing concept
No ratings yet
Hadoop in bigdata processing concept
2 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
BDA-Module2
No ratings yet
BDA-Module2
43 pages
Unit 2
No ratings yet
Unit 2
23 pages
UNIT-4-Hadoop Ecosystem-Part 1
No ratings yet
UNIT-4-Hadoop Ecosystem-Part 1
22 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
UNIT III
No ratings yet
UNIT III
107 pages
CT2 BDTT
No ratings yet
CT2 BDTT
6 pages
BDA
No ratings yet
BDA
8 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
dSbDa MiniProject Case Study
No ratings yet
dSbDa MiniProject Case Study
10 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
SUB UNIT 3 - Copy
No ratings yet
SUB UNIT 3 - Copy
9 pages
2.2. Components of Hadoop - Analysing.docx
No ratings yet
2.2. Components of Hadoop - Analysing.docx
16 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
hadoop.pptx
No ratings yet
hadoop.pptx
61 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
An Insight On Big Data Analytics Using Pig Script
No ratings yet
An Insight On Big Data Analytics Using Pig Script
7 pages
Week 14
No ratings yet
Week 14
33 pages
Hadoop Bascis.
No ratings yet
Hadoop Bascis.
19 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
58 pages
BDA-Lec5
No ratings yet
BDA-Lec5
40 pages
Hadoop
No ratings yet
Hadoop
5 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
CO3 Session 19
No ratings yet
CO3 Session 19
29 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Unit 1
No ratings yet
Unit 1
21 pages
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester_Scheme of Evaluation (1) - Copy
No ratings yet
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester_Scheme of Evaluation (1) - Copy
14 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
W1..well Aligned Objectives and Data
No ratings yet
W1..well Aligned Objectives and Data
5 pages
w3... Hands On Activity..clean Data Using SQL
No ratings yet
w3... Hands On Activity..clean Data Using SQL
10 pages
Big Query Help
No ratings yet
Big Query Help
4 pages
TC PPT @
No ratings yet
TC PPT @
18 pages
DS and AI IIT Madras Brochure 17aug
No ratings yet
DS and AI IIT Madras Brochure 17aug
20 pages
GC Application Module For FB107
No ratings yet
GC Application Module For FB107
52 pages
Trends in Artificial Intelligence Theory and Applications. Artificial Intelligence Practices: 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2020, Kitakyushu, Japan, September 22-25, Hamido Fujita pdf download
100% (4)
Trends in Artificial Intelligence Theory and Applications. Artificial Intelligence Practices: 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2020, Kitakyushu, Japan, September 22-25, Hamido Fujita pdf download
64 pages
Marissa Mayer: Experience Life Philosophy
No ratings yet
Marissa Mayer: Experience Life Philosophy
2 pages
V1.1 BSBTEC301 - RPL Assessment Tool Exemplar
No ratings yet
V1.1 BSBTEC301 - RPL Assessment Tool Exemplar
16 pages
E. H. Miller, "A Note On Reflector Arrays," IEEE: Transation On Antennas Propagation., To Be Published
No ratings yet
E. H. Miller, "A Note On Reflector Arrays," IEEE: Transation On Antennas Propagation., To Be Published
2 pages
Ekp File 0 Gti374 Indiansmartcities Analysisoflighthousecit 1467903687
No ratings yet
Ekp File 0 Gti374 Indiansmartcities Analysisoflighthousecit 1467903687
33 pages
FSUIPC7 Offsets Status
No ratings yet
FSUIPC7 Offsets Status
117 pages
Top 10 Risks On SAP Project
No ratings yet
Top 10 Risks On SAP Project
8 pages
Woven Fin Project
No ratings yet
Woven Fin Project
10 pages
Thermal Evaporator Systems 201121
No ratings yet
Thermal Evaporator Systems 201121
2 pages
Rainbow Goblin Farming
No ratings yet
Rainbow Goblin Farming
11 pages
MOHAMMAD ARSHID CV (2) - 1
No ratings yet
MOHAMMAD ARSHID CV (2) - 1
1 page
Book Summary PLC
No ratings yet
Book Summary PLC
53 pages
amphenol_1750_modules_03_2015
No ratings yet
amphenol_1750_modules_03_2015
4 pages
Keyframe Animation Transition
No ratings yet
Keyframe Animation Transition
21 pages
Bujes de Eje Leva
No ratings yet
Bujes de Eje Leva
3 pages
Rencana Kegiatan Revitalisasi Armada K.R. Baruna Jaya BPPT
No ratings yet
Rencana Kegiatan Revitalisasi Armada K.R. Baruna Jaya BPPT
9 pages
ISO-IEC-7816-8-2021
No ratings yet
ISO-IEC-7816-8-2021
13 pages
UDP Mcqs
No ratings yet
UDP Mcqs
3 pages
BEJE2242VA Block Diagram 1670405
No ratings yet
BEJE2242VA Block Diagram 1670405
1 page
The Relationship Between Smartphone and Academic Performance
No ratings yet
The Relationship Between Smartphone and Academic Performance
9 pages
Role of Artificial Intelligence in Corporate Training and Development - A Conceptual Paper
No ratings yet
Role of Artificial Intelligence in Corporate Training and Development - A Conceptual Paper
6 pages
s11277-022-09960-z
No ratings yet
s11277-022-09960-z
27 pages
ZTM ZTMB Head Loss Curves
No ratings yet
ZTM ZTMB Head Loss Curves
2 pages
MG34M422
No ratings yet
MG34M422
176 pages
Tally Question For Lab
No ratings yet
Tally Question For Lab
2 pages
101T-4 Triplex Plunger Pump: Specifications
No ratings yet
101T-4 Triplex Plunger Pump: Specifications
2 pages
Incident Investigation Report-Sample
No ratings yet
Incident Investigation Report-Sample
3 pages

Data Science Pipeline and Hadoop Ecosystem

Uploaded by

Data Science Pipeline and Hadoop Ecosystem

Uploaded by

DATA SCIENCE PIPELINE AND HADOOP

AND ALSO INCLUDE ABOUT

You might also like