0% found this document useful (0 votes)

29 views

00 - Lesson - Data Science Workflow - Jupyter Notebook

Belajar Data Sains Bagian Pertama

Uploaded by

almamalik

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

00 - Lesson - Data Science Workflow - Jupyter Notebook

Belajar Data Sains Bagian Pertama

Uploaded by

almamalik

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Data Science Workflow

Why Data Science?

Motivation to learn Data Science

Smart phones collect data

Data driven decisions in business
Understand the world

Job opportunities: Data Science with Python

Example: Weather forecast
Over 2.5 million jobs in data science and related professions (Burning Glass)
Step 1
Python most popular language for data science (KDnuggets)
Problem: Predict weather tomorrow
Data Science and Analytics professionals have average starting salary of over USD 80,000 in the US.
Data: Time series on Temperateture, Air pressure, Humidity, Rain, Wind speed, Wind direction, etc.
Import: Collect data from sources
Popularity? Step 2
The amount data available today Explore: Data quality
Data-driven applications have shown high value Visualize: A great way to understand data
Cleaning: Handle missing or faulty data
Step 3
Features: Select features to use in model
How did Data Science start? Model: Example in predicting rain/no rain (Tutorial + Video + Code + Project
(https://www.learnpythonwithrune.org/linear-classifier-from-scratch-explained-on-real-project/))
Data Analysis Helped Defeat Cholera (https://www.udacity.com/blog/2015/02/how-data-analysis-helped-
Analyze: Apply the prediction model
defeat-cholera.html)
Step 4
First Modern Weahter Forecast (https://en.wikipedia.org/wiki/Weather_forecasting#Modern_methods)
Present: Weather forecast
Visualize: Charts, maps, etc.
Credibility: Inaccurate results, too high confidence, not presenting full findings
Data Science: Understanding the Problem Step 5
Insights: What to wear, impact on outside events, etc.
Get the right question: Impact: Sales and weather forecast (umbrella, ice cream, etc.)
What is the problem we try to solve? Main goal: This is what makes Data Science valuable
This forms the Data Science problem
Examples
Sales figure and call center logs: evaluate a new product
Sensor data from multiple sensors: detect equipment failure Data Scientist Skills
Customer data + marketing data: better targeted marketing
Assess situation Beginners vs Experts Data Scientists
Risks, Benefits, Contingencies, Regulations, Resources, Requirement
Define goal
What is the objective?
What is the success criteria?
Conclusion
Defining the problem is key to successful Data Science projects

The Data Science Workflow

Feature Description

father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or
Fjob
'other')

reason reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')

guardian student's guardian (nominal: 'mother', 'father' or 'other')

traveltime home to school travel time (numeric: 1: <15 min., 2: 15-30 min., 3: 30 min. - 1 hour, or 4: >1 hour)

studytime weekly study time (numeric: 1: <2 hours, 2: 2 to 5 hours, 3: 5 to 10 hours, or 4: >10 hours)

failures number of past class failures (numeric: n if 1<=n<3, else 4)

schoolsup extra educational support (binary: yes or no)

famsup family educational support (binary: yes or no)

paid extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

Hard skills activities extra-curricular activities (binary: yes or no)

nursery attended nursery school (binary: yes or no)

Math and Statistics
Programming higher wants to take higher education (binary: yes or no)
Domain knowledge internet Internet access at home (binary: yes or no)
Data visualization
romantic with a romantic relationship (binary: yes or no)

famrel quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

Soft skills
freetime free time after school (numeric: from 1 - very low to 5 - very high)
Curiosity
goout going out with friends (numeric: from 1 - very low to 5 - very high)
Communication
Storytelling Skills Dalc workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

Structured Thinking Walc weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

health current health status (numeric: from 1 - very bad to 5 - very good)

absences number of school absences (numeric: from 0 to 93)

Student Grade Prediction
Predict the final grade of Portugese high school students (source: Kaggle Targets
(https://www.kaggle.com/dipam7/student-grade-prediction))
Feature Description

G1 first period grade (numeric: from 0 to 20)

Features
G2 second period grade (numeric: from 0 to 20)
Feature Description
G3 final grade (numeric: from 0 to 20, output target)
school student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)

sex student's sex (binary: 'F' - female or 'M' - male)

age student's age (numeric: from 15 to 22) Problem: Propose activities to improve G3 grades.
address student's home address type (binary: 'U' - urban or 'R' - rural)
Our Goal
famsize family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
To guide the school how they helps students getting higher grades
Pstatus parent's cohabitation status (binary: 'T' - living together or 'A' - apart)

mother's education (numeric: 0: none, 1: primary education (4th grade), 2: 5th to 9th grade, 3: secondary education
Medu
or 4: higher education)

father's education (numeric: 0: none, 1: primary education (4th grade), 2: 5th to 9th grade, 3: secondary education
Fedu
or 4: higher education) Programming Notes:
mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or
Mjob Libraries used
'other')
pandas (https://pandas.pydata.org) - a data analysis and manipulation tool
Functionality and concepts used
u ct o a ty a d co cepts used
CSV (https://en.wikipedia.org/wiki/Comma-separated_values) file (Lecture on CSV In [1]:
(https://youtu.be/LEyojSOg4EI))
read_csv() (https://pandas.pydata.org/pandas- import pandas as pd
docs/stable/reference/api/pandas.read_csv.html) read a comma-separated values
(csv) file into pandas DataFrame.
In [2]:
corr() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)
Compute pairwise correlation of pandas DataFrame columns, excluding NA/null data = pd.read_csv('files/student-mat.csv')
values.
groupby() (https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.groupby.html) Grouping a pandas In [3]:
DataFrame column to apply a function on it. len(data)
mean() (https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.mean.html) Return the mean of the Out[3]:
values over the requested axis in a pandas DataFrame.
std() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.std.html) 395
Return sample standard deviation over requested axis of a pandas DataFrame.
count() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html) In [4]:
Count non-NA cells for each column or row of a pandas DataFrame. data.head()

Out[4]:

Step 1: Acquire school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel fre
Explore problem 0 GP F 18 U GT3 A 4 4 at_home teacher ... 4
Identify data
1 GP F 17 U GT3 T 1 1 at_home other ... 5
Import data
2 GP F 15 U LE3 T 1 1 at_home other ... 4

3 GP F 15 U GT3 T 4 2 health services ... 3

Get the right questions 4 GP F 16 U GT3 T 3 3 other other ... 4

This forms the data science problem

5 rows × 33 columns
What is the problem

Examples
In [5]:
Data Problem
data.columns
Sales figure and call center logs evaluate a new product

Sensor data from multiple sensors detect equipment failure Out[5]:

Customer data & marketing data better targeted marketing
Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fed
u',
'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
Understand context 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dal
Student age? c',
What is possible? 'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
What is the budget? dtype='object')

Step 2: Prepare
Data
Explore data
Import and read the data
Visualize ideas
Cleaning data
g
In [7]:

Are the data types as expected? data.isnull().any()

In [6]: Out[7]:

data.dtypes school False

sex False
age False
Out[6]:
address False
school object famsize False
sex object Pstatus False
age int64 Medu False
address object Fedu False
famsize object Mjob False
Pstatus object Fjob False
Medu int64 reason False
Fedu int64 guardian False
Mjob object traveltime False
Fjob object studytime False
reason object failures False
guardian object schoolsup False
traveltime int64 famsup False
studytime int64 paid False
failures int64 activities False
schoolsup object nursery False
famsup object higher False
paid object internet False
activities object romantic False
nursery object famrel False
higher object freetime False
internet object goout False
romantic object Dalc False
famrel int64 Walc False
freetime int64 health False
goout int64 absences False
Dalc int64 G1 False
Walc int64 G2 False
health int64 G3 False
absences int64 dtype: bool
G1 int64
G2 int64
G3 int64
dtype: object Step 3: Analyze
Feature selection
Are there missing values? Model selection
Analyze data*

Explore correlation

How to interpret correlation

In [12]:

data.groupby('higher')['G3'].mean()

Out[12]:

higher
no 6.800
yes 10.608
Name: G3, dtype: float64
In [10]:

data.corr()['G3'] In [13]:

data.groupby('higher')['G3'].count()
Out[10]:

age -0.161579 Out[13]:

Medu 0.217147
higher
Fedu 0.152457
no 20
traveltime -0.117142
yes 375
studytime 0.097820
Name: G3, dtype: int64
failures -0.360415
famrel 0.051363
freetime 0.011307
goout -0.132791 Standard deviation
Dalc -0.054660
Walc -0.051939
health -0.061335
absences 0.034247
G1 0.801468
G2 0.904868
G3 1.000000
Name: G3, dtype: float64

Notice that it only makes correlation with numeric features.

In [14]:
In [11]:
data.groupby('higher')['G3'].std()
data.columns

Out[14]:
Out[11]:
higher
Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fed
no 4.829732
u',
yes 4.493422
'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
Name: G3, dtype: float64
'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dal
c',
'Walc', 'health', 'absences', 'G1', 'G2', 'G3'], Step 4: Report
dtype='object')
Present findings
Visualize results
Find high impact features - but what can you do about it? Medu, Mjob
Credibility counts

Credibility

The impact we expect

In [15]:

data['G3'].mean()

Out[15]:

10.415189873417722

In [16]:

data.groupby('higher')['G3'].mean()

Out[16]:

higher
no 6.800
yes 10.608
Name: G3, dtype: float64

In [17]:

data.groupby('higher')['G3'].count()

Out[17]:

higher
no 20
yes 375
Name: G3, dtype: int64

Impact not the huge

Step 5: Actions
Use insights
Measure impact
Main goal

Impact measures

Same data after effort

Measure if 'higher' is lower
Did it influence G3?

In [ ]:

Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
100% (1)
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
41 pages
Six Weeks Summer Training Reportpdf
100% (1)
Six Weeks Summer Training Reportpdf
26 pages
Introduction to Data-Science
No ratings yet
Introduction to Data-Science
246 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Inteliment Technologies Presentation
No ratings yet
Inteliment Technologies Presentation
11 pages
Data Science
No ratings yet
Data Science
85 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data Sciences in Telecommunication-Chapitre-1
No ratings yet
Data Sciences in Telecommunication-Chapitre-1
20 pages
DSV Module-2
No ratings yet
DSV Module-2
23 pages
Introduction To Data Science - Ii-I
No ratings yet
Introduction To Data Science - Ii-I
128 pages
Data Science
100% (2)
Data Science
33 pages
Data Science Report - Compress
No ratings yet
Data Science Report - Compress
31 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
16 pages
Six Weeks Summer Training Report PDF
100% (2)
Six Weeks Summer Training Report PDF
26 pages
Unit 1
No ratings yet
Unit 1
19 pages
github_com
No ratings yet
github_com
2 pages
Datascience Slide preparation notes
No ratings yet
Datascience Slide preparation notes
3 pages
Model_Qp_Scheme-2
No ratings yet
Model_Qp_Scheme-2
19 pages
An Introduction To The WEKA Data Mining System
No ratings yet
An Introduction To The WEKA Data Mining System
74 pages
Lecture 1 Introduction Tools An - Chniques For Data Science
No ratings yet
Lecture 1 Introduction Tools An - Chniques For Data Science
16 pages
Unit 1 - DSA
No ratings yet
Unit 1 - DSA
12 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Data Science Notes
No ratings yet
Data Science Notes
13 pages
01-R Basics
No ratings yet
01-R Basics
65 pages
Data Science Task 1-1 - Converted
No ratings yet
Data Science Task 1-1 - Converted
2 pages
Acquiring Data, Processing and Interpreting Data
No ratings yet
Acquiring Data, Processing and Interpreting Data
18 pages
Data Science and The Essential Terms 2
No ratings yet
Data Science and The Essential Terms 2
4 pages
Data Science-New (Unit-I)
No ratings yet
Data Science-New (Unit-I)
18 pages
Data-Science - Introduction
No ratings yet
Data-Science - Introduction
35 pages
DAT100_Int_Data_Ana_Lec2_Intro II
No ratings yet
DAT100_Int_Data_Ana_Lec2_Intro II
39 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
Basic of ds
No ratings yet
Basic of ds
14 pages
IDS - UNIT-2 - Notes part1_Introduction to Data Science and Prob concept[1]
No ratings yet
IDS - UNIT-2 - Notes part1_Introduction to Data Science and Prob concept[1]
66 pages
Unit-2 - DS Notes
No ratings yet
Unit-2 - DS Notes
22 pages
BUSINESS ANALYTICS UNIT I
No ratings yet
BUSINESS ANALYTICS UNIT I
45 pages
Chapter 1 (6)
No ratings yet
Chapter 1 (6)
62 pages
Data Science 1A
100% (1)
Data Science 1A
53 pages
DS
No ratings yet
DS
94 pages
FDS UNIT 1 QB
No ratings yet
FDS UNIT 1 QB
7 pages
UNIT I Material
No ratings yet
UNIT I Material
25 pages
An Introduction To Data Science (2022 Updated Edition)
No ratings yet
An Introduction To Data Science (2022 Updated Edition)
9 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
Data analyses
No ratings yet
Data analyses
9 pages
UNIT I Complete Notes
No ratings yet
UNIT I Complete Notes
5 pages
Data Science Project A01735388
No ratings yet
Data Science Project A01735388
21 pages
AI Lecture 6
No ratings yet
AI Lecture 6
23 pages
Data Science
No ratings yet
Data Science
46 pages
Data Discourse Over The Years
No ratings yet
Data Discourse Over The Years
7 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Module 1
No ratings yet
Module 1
35 pages
Tools and Techniques For Data Science
No ratings yet
Tools and Techniques For Data Science
139 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
DSBDA Unit 1
No ratings yet
DSBDA Unit 1
16 pages
data scince report
No ratings yet
data scince report
11 pages
Guest Lecture 25 November 2023
No ratings yet
Guest Lecture 25 November 2023
7 pages
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Operating Sistem Chapter 7
No ratings yet
Operating Sistem Chapter 7
33 pages
Operating Sistem Chapter 4
No ratings yet
Operating Sistem Chapter 4
23 pages
Operating Sistem Chapter 8
No ratings yet
Operating Sistem Chapter 8
24 pages
Operating Sistem Chapter 5
No ratings yet
Operating Sistem Chapter 5
22 pages
Operating Sistem Chapter 3
No ratings yet
Operating Sistem Chapter 3
24 pages
Operating Sistem Chapter 6
No ratings yet
Operating Sistem Chapter 6
17 pages
Operating Sistem Chapter 2
No ratings yet
Operating Sistem Chapter 2
25 pages
Operating Sistem Chapter 1
No ratings yet
Operating Sistem Chapter 1
18 pages
Tutorial Matplotlib
No ratings yet
Tutorial Matplotlib
75 pages
Tutotial Powersim Studio 10 Part 1
No ratings yet
Tutotial Powersim Studio 10 Part 1
28 pages
Studio 2003 Users Manual 01
No ratings yet
Studio 2003 Users Manual 01
28 pages
Practice Six Steps
No ratings yet
Practice Six Steps
4 pages
A System Dynamic Simulation Model
No ratings yet
A System Dynamic Simulation Model
62 pages
The System Dynamics As A Tool For Modeling Healtcare System
No ratings yet
The System Dynamics As A Tool For Modeling Healtcare System
8 pages
Pascal Triangle Final
No ratings yet
Pascal Triangle Final
4 pages
LET'S Celebrate RADHA RANI's Appearance Day
No ratings yet
LET'S Celebrate RADHA RANI's Appearance Day
11 pages
F&E Drug Study
No ratings yet
F&E Drug Study
2 pages
B7801 Operations Management: 18 August 2000 Nelson M. Fraiman
No ratings yet
B7801 Operations Management: 18 August 2000 Nelson M. Fraiman
74 pages
ME5608 Term Paper
No ratings yet
ME5608 Term Paper
2 pages
Workshop Manuel Cab 500
No ratings yet
Workshop Manuel Cab 500
197 pages
Narnia3 - Choose Your Own Adventure
100% (2)
Narnia3 - Choose Your Own Adventure
181 pages
Hearthkeeper's Almanac Preview PDF
No ratings yet
Hearthkeeper's Almanac Preview PDF
14 pages
2008 - Supplemental Protocol For Liana Censuses
No ratings yet
2008 - Supplemental Protocol For Liana Censuses
6 pages
s11148 008 9034 2
No ratings yet
s11148 008 9034 2
2 pages
OMNI 3D 2015 Brochure
No ratings yet
OMNI 3D 2015 Brochure
8 pages
@boardexamss Arihant All in One Class 10th WWW - Ultraedu.in
67% (6)
@boardexamss Arihant All in One Class 10th WWW - Ultraedu.in
542 pages
Pascas Care Pranic Healing-1 PDF
100% (2)
Pascas Care Pranic Healing-1 PDF
81 pages
Preposition Conjunction
No ratings yet
Preposition Conjunction
4 pages
Resistor Colour Code: Table-1
No ratings yet
Resistor Colour Code: Table-1
3 pages
Callander Heritage Trail
No ratings yet
Callander Heritage Trail
10 pages
nx-rm820 Manual
No ratings yet
nx-rm820 Manual
18 pages
Network Analysis and Synthesis Guide Book(1)
No ratings yet
Network Analysis and Synthesis Guide Book(1)
4 pages
4 - Probe Automated Setup
No ratings yet
4 - Probe Automated Setup
2 pages
PE7 Q3 Mod1 Self Assessment On Physical Fitness
No ratings yet
PE7 Q3 Mod1 Self Assessment On Physical Fitness
22 pages
R.D.S.O: Manak Nagar, Lucknow
No ratings yet
R.D.S.O: Manak Nagar, Lucknow
23 pages
GS10 SMS Command Wanwaytech 6.21 Manual 1
100% (1)
GS10 SMS Command Wanwaytech 6.21 Manual 1
2 pages
Mechanics of Materials 10th Edition Hibbeler Solutions Manual - Download Now To Experience The Complete Book
100% (2)
Mechanics of Materials 10th Edition Hibbeler Solutions Manual - Download Now To Experience The Complete Book
34 pages
Connection: Horizontal - Member - Rhs Bolted Moment End Plate
No ratings yet
Connection: Horizontal - Member - Rhs Bolted Moment End Plate
3 pages
Petition Before SDM, Panipat and Samalkha Under Section 133 CRPC To Remove Public Nuisance - Abhishek Kadyan
No ratings yet
Petition Before SDM, Panipat and Samalkha Under Section 133 CRPC To Remove Public Nuisance - Abhishek Kadyan
78 pages
TEMS Discovery 21.2.1 Release Note
No ratings yet
TEMS Discovery 21.2.1 Release Note
32 pages
SE Unit-IV
No ratings yet
SE Unit-IV
15 pages
Asme Sec Viii D1 Ma App 9 PDF
No ratings yet
Asme Sec Viii D1 Ma App 9 PDF
9 pages
Reading 2 Instrument Types and Characteristics
No ratings yet
Reading 2 Instrument Types and Characteristics
12 pages
Juliana Smith PD
No ratings yet
Juliana Smith PD
11 pages

00 - Lesson - Data Science Workflow - Jupyter Notebook

Uploaded by

00 - Lesson - Data Science Workflow - Jupyter Notebook

Uploaded by

Data Science Workflow

Why Data Science?

Motivation to learn Data Science

Smart phones collect data

Job opportunities: Data Science with Python

The Data Science Workflow

guardian student's guardian (nominal: 'mother', 'father' or 'other')

failures number of past class failures (numeric: n if 1<=n<3, else 4)

schoolsup extra educational support (binary: yes or no)

famsup family educational support (binary: yes or no)

Hard skills activities extra-curricular activities (binary: yes or no)

nursery attended nursery school (binary: yes or no)

famrel quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

absences number of school absences (numeric: from 0 to 93)

G1 first period grade (numeric: from 0 to 20)

sex student's sex (binary: 'F' - female or 'M' - male)

3 GP F 15 U GT3 T 4 2 health services ... 3

Get the right questions 4 GP F 16 U GT3 T 3 3 other other ... 4

This forms the data science problem

Sensor data from multiple sensors detect equipment failure Out[5]:

Are the data types as expected? data.isnull().any()

data.dtypes school False

How to interpret correlation

age -0.161579 Out[13]:

Notice that it only makes correlation with numeric features.

The impact we expect

Impact not the huge

Same data after effort

You might also like