The document discusses the data science pipeline and Hadoop ecosystem. It describes the data science pipeline process which includes obtaining data, cleaning data, exploratory data analysis, modeling data, and interpreting data. It also discusses the OSEMN framework which follows a similar process of obtain, scrub, explore, model, and interpret. The document then explains the major components of the Hadoop ecosystem including HDFS for distributed storage, MapReduce for distributed processing, YARN for resource management, and common utilities.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
87 views8 pages
Data Science Pipeline and Hadoop Ecosystem
The document discusses the data science pipeline and Hadoop ecosystem. It describes the data science pipeline process which includes obtaining data, cleaning data, exploratory data analysis, modeling data, and interpreting data. It also discusses the OSEMN framework which follows a similar process of obtain, scrub, explore, model, and interpret. The document then explains the major components of the Hadoop ecosystem including HDFS for distributed storage, MapReduce for distributed processing, YARN for resource management, and common utilities.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 8
DATA SCIENCE PIPELINE AND HADOOP
ECOSYSTEM
AND ALSO INCLUDE ABOUT
DABL DATA SCIENCE PIPELINE • In simple words, a pipeline in data science is “a set of actions which changes the raw (and confusing) data from various sources (surveys, feedbacks, list of purchases, votes, etc.), to an understandable format so that we can store it and use it for analysis.” PROCESS OF DATA SCIENCE PIPELINE • Fetching/ Obtaining the data. • Scrubbing/ Cleaning the data. • Exploratory Data Analysis. • Modelling the Data. • Interpreting the data. THE OSEMN FRAMEWORK PROCESS OF OSEMN FRAMWORK • Obtain the data : we obtain the data from different data sources. • Scrub Data : After obtaining data, the next immediate thing to do is scrubbing data. This process is for us to “clean” and to filter the data. • Explore data : Once your data is ready to be used, and right before you jump into AI and Machine Learning, you will have to examine the data. • Model Data : This is the stage where most people consider interesting. As many people call it “where the magic happens”. • Interpreting Data : Interpreting data refers to the presentation of your data to a non-technical layman. HADOOP ECO-SYSTEM • Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. • It includes Apache projects and various commercial tools and solutions. • There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. • Most of the tools or solutions are used to supplement or support these major elements. • All these tools work collectively to provide services such as absorption, analysis, storage and maintenance of data etc. • HDFS : HDFS is a distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. • MAP-REDUCE : MapReduce is a programming model for writing applications that can process Big Data in parallel on multiple nodes. MapReduce provides analytical capabilities for analyzing huge volumes of complex data. • YARN : YARN is a large-scale, distributed operating system for big data applications. The technology is designed for cluster management and is one of the key features in the second generation of Hadoop, the Apache Software Foundation's open source distributed processing framework. • HADOOP COMMON : Hadoop Common refers to the collection of common utilities and libraries that support other Hadoop modules. It is an essential part or module of the Apache Hadoop Framework, along with the Hadoop Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce.
Trends in Artificial Intelligence Theory and Applications. Artificial Intelligence Practices: 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2020, Kitakyushu, Japan, September 22-25, Hamido Fujita pdf download