0% found this document useful (0 votes)
87 views8 pages

Data Science Pipeline and Hadoop Ecosystem

The document discusses the data science pipeline and Hadoop ecosystem. It describes the data science pipeline process which includes obtaining data, cleaning data, exploratory data analysis, modeling data, and interpreting data. It also discusses the OSEMN framework which follows a similar process of obtain, scrub, explore, model, and interpret. The document then explains the major components of the Hadoop ecosystem including HDFS for distributed storage, MapReduce for distributed processing, YARN for resource management, and common utilities.

Uploaded by

Shiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views8 pages

Data Science Pipeline and Hadoop Ecosystem

The document discusses the data science pipeline and Hadoop ecosystem. It describes the data science pipeline process which includes obtaining data, cleaning data, exploratory data analysis, modeling data, and interpreting data. It also discusses the OSEMN framework which follows a similar process of obtain, scrub, explore, model, and interpret. The document then explains the major components of the Hadoop ecosystem including HDFS for distributed storage, MapReduce for distributed processing, YARN for resource management, and common utilities.

Uploaded by

Shiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 8

DATA SCIENCE PIPELINE AND HADOOP

ECOSYSTEM

AND ALSO INCLUDE ABOUT


DABL
DATA SCIENCE PIPELINE
• In simple words, a pipeline in data science is “a set of actions which
changes the raw (and confusing) data from various sources (surveys,
feedbacks, list of purchases, votes, etc.), to an understandable format
so that we can store it and use it for analysis.”
PROCESS OF DATA SCIENCE PIPELINE
• Fetching/ Obtaining the data.
• Scrubbing/ Cleaning the data.
• Exploratory Data Analysis.
• Modelling the Data.
• Interpreting the data.
THE OSEMN FRAMEWORK
PROCESS OF OSEMN FRAMWORK
• Obtain the data : we obtain the data from different data sources.
• Scrub Data : After obtaining data, the next immediate thing to do is
scrubbing data. This process is for us to “clean” and to filter the data.
• Explore data : Once your data is ready to be used, and right before you
jump into AI and Machine Learning, you will have to examine the
data.
• Model Data : This is the stage where most people consider interesting.
As many people call it “where the magic happens”.
• Interpreting Data : Interpreting data refers to the presentation of your
data to a non-technical layman.
HADOOP ECO-SYSTEM
• Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems.
• It includes Apache projects and various commercial tools and
solutions.
• There are four major elements of Hadoop i.e. HDFS, MapReduce,
YARN, and Hadoop Common.
• Most of the tools or solutions are used to supplement or support these major elements.
• All these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
• HDFS : HDFS is a distributed file system that handles large data sets running on commodity
hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of
nodes.
• MAP-REDUCE : MapReduce is a programming model for writing applications that can process
Big Data in parallel on multiple nodes. MapReduce provides analytical capabilities for analyzing
huge volumes of complex data.
• YARN : YARN is a large-scale, distributed operating system for big data applications. The
technology is designed for cluster management and is one of the key features in the second
generation of Hadoop, the Apache Software Foundation's open source distributed processing
framework.
• HADOOP COMMON : Hadoop Common refers to the collection of common utilities and libraries
that support other Hadoop modules. It is an essential part or module of the Apache Hadoop
Framework, along with the Hadoop Distributed File System (HDFS), Hadoop YARN and Hadoop
MapReduce.

You might also like