SlideShare a Scribd company logo
Manoj Mishra
November 23, 2017
DataScience Lifecycle
 Introduction
 LifeCycle
 SkillTree
 Questions
AGENDA
DATA SCIENTIST
60%
19%
9%
7%
5%
Effort Organize & Clean Data
Collect data / Dataset
Data Mining to draw
pattern
Model Selection ,
training and refining
Other Tasks
Data
Acquisition
Data
Preparation
Hypothesis &
Modelling
DATA SCIENCE LIFE CYCLE
Evaluation &
Interpretation
DeploymentOperations Optimization
Business Understanding
DATA ACQUISITION
Static
• Feedback system
• CSV Data sets / text files
Live
• Logs data, memory dumps
• Sensors, controllers etc.
Virtual
• Data Virtualization
• Caching , Storing
DATA SAMPLE DATASET INVENTORY
PROJECT - PREDICTING FAILURE – PROACTIVE
MAINTENANCE
• Baseline normal operational
patterns by modelling the
unstructured Log data.
• Use Domain Experts to identify
patterns before failures.
• Use statistical measurements &
Machine Learning to determine
threshold.
• Identify patterns of activity to
anticipate and react to
circumstances that might
otherwise disrupt operations
SME -
Domain
DATA PREPARATION
• Need for Data Preparation
• Bad data or poor quality data can alter
accuracy & lead to incorrect Insights
• Gartner- Poor quality data costs an avg.
organization $13.5M / year.
• Dataset might contain discrepancies in the
names or codes.
• Dataset might contain outliers or errors.
• Dataset lacks your attributes of interest for
analysis.
• All in all the dataset is not qualitative but is
just quantitative.
• Steps Involved
DATA PREPARATION
• Includes steps to explore, preprocess, and condition data
• Create robust environment – analytics sandbox
• Data preparation tends to be the most labor-intensive step in the
analytics lifecycle
• Often at least 50 – 60% of the data science project’s time
• The data preparation phase is generally the most iterative and the one
that data scientists tend to underestimate most often 
Database :
Understand the data
Understand the Business
Airlines :
NYC FLIGHTS 13
e.g. tailnum :
A tail number refers to an
identification number painte
d on an aircraft, frequently
on the tail
Goal : Predicting
flight delays Modelling
PREDICTING FLIGHT DELAYS - NYC FLIGHTS 13
• Exploratory Data Analysis
of the flight data for
inbound and outbound
flights for year 2013 in
NYC.
• Find patterns, benchmark,
model and find
predictors.
• To predict the flight
delays for NYC Inbound/
Outbound flights.
• Ref. DataSet
PREDICTING FLIGHT DELAYS - NYC FLIGHTS 13
• Exploratory Data Analysis
of the flight data for
inbound and outbound
flights for year 2013 in
NYC.
• Find patterns, benchmark,
model and find
predictors.
• To predict the flight
delays for NYC Inbound/
Outbound flights.
• Ref. DataSet
OBSERVATIONS - NYC FLIGHTS 13
REF. DATASET
COMMON TOOLS - FOR DATA
PREPARATION
• Alpine Miner provides a graphical user interface for creating analytic
workflows
• OpenRefine (formerly Google Refine) is a free, open source tool for
working with messy data
• Similar to OpenRefine, Data Wrangler is an interactive tool for data
cleansing an transformation
• Alteryx and Informatica also can be tried.
HYPOTHESIS & MODELLING
• There are three main tasks addressed in this stage:
• Feature engineering: Create data features from the raw data to
facilitate model training.
• Model training: Find the model that answers the question most
accurately by comparing their success metrics.
• Determine if your model is suitable for production.
FEATURE SELECTION & ENGINEERING
Select
Features
Research
feature
relevance
Experiment
& Validate
Change the
feature set
if required
Go back to
feature
selection
FEATURE ENGINEERING
Date # footfalls in Dubai Mall
01/07/2017 124532
02/07/2017 65434
03/07/2017 12333
04/07/2017 60009
05/07/2017 46567
06/07/2017 98001
07/07/2017 146543
08/07/2017 112345
09/07/2017 76543
Date # footfalls in Dubai Mall IsHoliday?
01/07/2017 124532Yes
02/07/2017 65434No
03/07/2017 12333No
04/07/2017 60009No
05/07/2017 46567No
06/07/2017 98001No
07/07/2017 146543yes
08/07/2017 112345yes
09/07/2017 76543No
FEATURE ENGINEERING
Seasonality (holiday season):
Jun, Jul & Dec account for the
highest avg. delays
MODELLING
CREATE YOUR MODEL & EVALUATE
• Split the input data randomly for modeling into a training data set and a test data
set.
• Build the models by using the training data set.
• Evaluate the training and the test data set. Use a series of competing machine-
learning algorithms along with the various associated tuning parameters (known as
a parameter sweep) that are geared toward answering the question of interest with
the current data.
• Determine the “best” solution to answer the question by comparing the success
metrics between alternative methods.
CREATE YOUR MODEL & EVALUATE
• Supervised Learning
• Naive Bayes
• KNN
• Support Vector Machines
(SVM)
• Linear Regression
• Unsupervised Learning
• Principal Component Analysis.
• K Means
• Classification Metrics
• Accuracy Score
• Classification Report
• Confusion Matrix
• Regression Metrics
• Mean Absolute Error.
• Mean Squared Error
• R2 Score
• Clustering Metrics
• Adjusted Rand Index.
• Homogeneity
• V - measure
DEPLOYMENT
After you have a set of models that perform well, you can operationalize
them for other applications through APIs or other interface to consume
from various applications, such as:
• Online websites
• Spreadsheets
• Dashboards
• Line-of-business applications
• Back-end applications
Data science life cycle final
THANK YOU !

More Related Content

What's hot (20)

PPTX
Data Analytics Life Cycle
Dr. C.V. Suresh Babu
 
PPTX
Evolution of big data
ShilpaKrishna6
 
PPTX
Data Science Lifecycle
SwapnilDahake2
 
PPTX
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
Edureka!
 
PPTX
Data science | What is Data science
ShilpaKrishna6
 
PDF
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Edureka!
 
DOCX
Datascienceindia article
HimanshuPise1
 
PDF
Different Career Paths in Data Science
Roger Huang
 
PPTX
Data analytics
Dr.Bhuvaneswari Velumani
 
PPTX
Predictive analytics and big data tutorial
Benjamin Taylor
 
PDF
Introduction to Data Science
ANOOP V S
 
PPTX
Welcome to CS310!
Dmitry Zinoviev
 
PPTX
Data analytics
HimanshuPise2
 
PPTX
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
PDF
Unit 3 part 2
MohammadAsharAshraf
 
PPTX
Data science 101
University of West Florida
 
PPTX
Introduction to Data Analytics
Dr. C.V. Suresh Babu
 
PPTX
Data science applications and usecases
Sreenatha Reddy K R
 
PPTX
Introduction to Data Science
Caserta
 
PDF
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Edureka!
 
Data Analytics Life Cycle
Dr. C.V. Suresh Babu
 
Evolution of big data
ShilpaKrishna6
 
Data Science Lifecycle
SwapnilDahake2
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
Edureka!
 
Data science | What is Data science
ShilpaKrishna6
 
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Edureka!
 
Datascienceindia article
HimanshuPise1
 
Different Career Paths in Data Science
Roger Huang
 
Data analytics
Dr.Bhuvaneswari Velumani
 
Predictive analytics and big data tutorial
Benjamin Taylor
 
Introduction to Data Science
ANOOP V S
 
Welcome to CS310!
Dmitry Zinoviev
 
Data analytics
HimanshuPise2
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Unit 3 part 2
MohammadAsharAshraf
 
Data science 101
University of West Florida
 
Introduction to Data Analytics
Dr. C.V. Suresh Babu
 
Data science applications and usecases
Sreenatha Reddy K R
 
Introduction to Data Science
Caserta
 
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Edureka!
 

Similar to Data science life cycle final (20)

PDF
Exploring the Data science Process
Vishal Patel
 
PDF
The Data Science Process
Vishal Patel
 
PPTX
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
XanGwaps
 
PPTX
Data Science_Unit-1.2 part - 2 of intro.pptx
sagarrathore52204
 
PDF
Data science guide
gokulprasath06
 
PPTX
Chapter 2 Introduction to CR_Process.pptx
TitiA3
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PPT
Data Mining
shrapb
 
PPT
Lecture1
sumit621
 
PDF
Data Science Presentation.pdf
AamirJadoon5
 
PDF
Data Driven Engineering 2014
Roger Barga
 
PDF
Building a performing Machine Learning model from A to Z
Charles Vestur
 
PPTX
Data Science applications in business
Vladyslav Yakovenko
 
PDF
Internship Presentation.pdf
vishwajeetparmar1
 
PPTX
Data Science Data Science Data Science.pptx
DrMuhammadNawazKhan
 
PDF
Lec 1 integrating data science and data analytics in various research thrust
Menchita Falcutila Dumlao
 
PDF
Understanding-the-Data-Science-Lifecycle
Ozias Rondon
 
PDF
Barga Galvanize Sept 2015
Roger Barga
 
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
Exploring the Data science Process
Vishal Patel
 
The Data Science Process
Vishal Patel
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
XanGwaps
 
Data Science_Unit-1.2 part - 2 of intro.pptx
sagarrathore52204
 
Data science guide
gokulprasath06
 
Chapter 2 Introduction to CR_Process.pptx
TitiA3
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
Data Mining
shrapb
 
Lecture1
sumit621
 
Data Science Presentation.pdf
AamirJadoon5
 
Data Driven Engineering 2014
Roger Barga
 
Building a performing Machine Learning model from A to Z
Charles Vestur
 
Data Science applications in business
Vladyslav Yakovenko
 
Internship Presentation.pdf
vishwajeetparmar1
 
Data Science Data Science Data Science.pptx
DrMuhammadNawazKhan
 
Lec 1 integrating data science and data analytics in various research thrust
Menchita Falcutila Dumlao
 
Understanding-the-Data-Science-Lifecycle
Ozias Rondon
 
Barga Galvanize Sept 2015
Roger Barga
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
Ad

Recently uploaded (20)

PDF
Before tackling these green level readers child Will need to be able to
startshws
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPT
Classification and Prediction_ai_101.ppt
fmodtel
 
PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PDF
Basotho Satisfaction with Electricity(Statspack)
KatlehoMefane
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
Credit Card Fraud Detection Presentation
rasmilalama
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
DOCX
Discover the Key Benefits of Implementing Data Mesh Architecture.docx
ajaykumar405166
 
PDF
APEX PROGRAMME _ JEE MAIN _ REVISION SCHEDULE_2025-26 (11 07 2025) 6 PM.pdf
dhanvin1493
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
UPS Case Study - Group 5 with example and implementation .pptx
yasserabdelwahab6
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
DOCX
Online Delivery Restaurant idea and analyst the data
sejalsengar2323
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
Before tackling these green level readers child Will need to be able to
startshws
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Classification and Prediction_ai_101.ppt
fmodtel
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
Basotho Satisfaction with Electricity(Statspack)
KatlehoMefane
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Credit Card Fraud Detection Presentation
rasmilalama
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Discover the Key Benefits of Implementing Data Mesh Architecture.docx
ajaykumar405166
 
APEX PROGRAMME _ JEE MAIN _ REVISION SCHEDULE_2025-26 (11 07 2025) 6 PM.pdf
dhanvin1493
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
UPS Case Study - Group 5 with example and implementation .pptx
yasserabdelwahab6
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Online Delivery Restaurant idea and analyst the data
sejalsengar2323
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
Ad

Data science life cycle final

  • 1. Manoj Mishra November 23, 2017 DataScience Lifecycle
  • 2.  Introduction  LifeCycle  SkillTree  Questions AGENDA
  • 3. DATA SCIENTIST 60% 19% 9% 7% 5% Effort Organize & Clean Data Collect data / Dataset Data Mining to draw pattern Model Selection , training and refining Other Tasks
  • 4. Data Acquisition Data Preparation Hypothesis & Modelling DATA SCIENCE LIFE CYCLE Evaluation & Interpretation DeploymentOperations Optimization Business Understanding
  • 5. DATA ACQUISITION Static • Feedback system • CSV Data sets / text files Live • Logs data, memory dumps • Sensors, controllers etc. Virtual • Data Virtualization • Caching , Storing
  • 7. PROJECT - PREDICTING FAILURE – PROACTIVE MAINTENANCE • Baseline normal operational patterns by modelling the unstructured Log data. • Use Domain Experts to identify patterns before failures. • Use statistical measurements & Machine Learning to determine threshold. • Identify patterns of activity to anticipate and react to circumstances that might otherwise disrupt operations SME - Domain
  • 8. DATA PREPARATION • Need for Data Preparation • Bad data or poor quality data can alter accuracy & lead to incorrect Insights • Gartner- Poor quality data costs an avg. organization $13.5M / year. • Dataset might contain discrepancies in the names or codes. • Dataset might contain outliers or errors. • Dataset lacks your attributes of interest for analysis. • All in all the dataset is not qualitative but is just quantitative. • Steps Involved
  • 9. DATA PREPARATION • Includes steps to explore, preprocess, and condition data • Create robust environment – analytics sandbox • Data preparation tends to be the most labor-intensive step in the analytics lifecycle • Often at least 50 – 60% of the data science project’s time • The data preparation phase is generally the most iterative and the one that data scientists tend to underestimate most often 
  • 10. Database : Understand the data Understand the Business Airlines : NYC FLIGHTS 13 e.g. tailnum : A tail number refers to an identification number painte d on an aircraft, frequently on the tail Goal : Predicting flight delays Modelling
  • 11. PREDICTING FLIGHT DELAYS - NYC FLIGHTS 13 • Exploratory Data Analysis of the flight data for inbound and outbound flights for year 2013 in NYC. • Find patterns, benchmark, model and find predictors. • To predict the flight delays for NYC Inbound/ Outbound flights. • Ref. DataSet
  • 12. PREDICTING FLIGHT DELAYS - NYC FLIGHTS 13 • Exploratory Data Analysis of the flight data for inbound and outbound flights for year 2013 in NYC. • Find patterns, benchmark, model and find predictors. • To predict the flight delays for NYC Inbound/ Outbound flights. • Ref. DataSet
  • 13. OBSERVATIONS - NYC FLIGHTS 13 REF. DATASET
  • 14. COMMON TOOLS - FOR DATA PREPARATION • Alpine Miner provides a graphical user interface for creating analytic workflows • OpenRefine (formerly Google Refine) is a free, open source tool for working with messy data • Similar to OpenRefine, Data Wrangler is an interactive tool for data cleansing an transformation • Alteryx and Informatica also can be tried.
  • 15. HYPOTHESIS & MODELLING • There are three main tasks addressed in this stage: • Feature engineering: Create data features from the raw data to facilitate model training. • Model training: Find the model that answers the question most accurately by comparing their success metrics. • Determine if your model is suitable for production.
  • 16. FEATURE SELECTION & ENGINEERING Select Features Research feature relevance Experiment & Validate Change the feature set if required Go back to feature selection
  • 17. FEATURE ENGINEERING Date # footfalls in Dubai Mall 01/07/2017 124532 02/07/2017 65434 03/07/2017 12333 04/07/2017 60009 05/07/2017 46567 06/07/2017 98001 07/07/2017 146543 08/07/2017 112345 09/07/2017 76543 Date # footfalls in Dubai Mall IsHoliday? 01/07/2017 124532Yes 02/07/2017 65434No 03/07/2017 12333No 04/07/2017 60009No 05/07/2017 46567No 06/07/2017 98001No 07/07/2017 146543yes 08/07/2017 112345yes 09/07/2017 76543No
  • 18. FEATURE ENGINEERING Seasonality (holiday season): Jun, Jul & Dec account for the highest avg. delays
  • 20. CREATE YOUR MODEL & EVALUATE • Split the input data randomly for modeling into a training data set and a test data set. • Build the models by using the training data set. • Evaluate the training and the test data set. Use a series of competing machine- learning algorithms along with the various associated tuning parameters (known as a parameter sweep) that are geared toward answering the question of interest with the current data. • Determine the “best” solution to answer the question by comparing the success metrics between alternative methods.
  • 21. CREATE YOUR MODEL & EVALUATE • Supervised Learning • Naive Bayes • KNN • Support Vector Machines (SVM) • Linear Regression • Unsupervised Learning • Principal Component Analysis. • K Means • Classification Metrics • Accuracy Score • Classification Report • Confusion Matrix • Regression Metrics • Mean Absolute Error. • Mean Squared Error • R2 Score • Clustering Metrics • Adjusted Rand Index. • Homogeneity • V - measure
  • 22. DEPLOYMENT After you have a set of models that perform well, you can operationalize them for other applications through APIs or other interface to consume from various applications, such as: • Online websites • Spreadsheets • Dashboards • Line-of-business applications • Back-end applications