0% found this document useful (0 votes)
29 views

00 - Lesson - Data Science Workflow - Jupyter Notebook

Belajar Data Sains Bagian Pertama

Uploaded by

almamalik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

00 - Lesson - Data Science Workflow - Jupyter Notebook

Belajar Data Sains Bagian Pertama

Uploaded by

almamalik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Data Science Workflow

Why Data Science?

Motivation to learn Data Science

Smart phones collect data


Data driven decisions in business
Understand the world

Job opportunities: Data Science with Python


Example: Weather forecast
Over 2.5 million jobs in data science and related professions (Burning Glass)
Step 1
Python most popular language for data science (KDnuggets)
Problem: Predict weather tomorrow
Data Science and Analytics professionals have average starting salary of over USD 80,000 in the US.
Data: Time series on Temperateture, Air pressure, Humidity, Rain, Wind speed, Wind direction, etc.
Import: Collect data from sources
Popularity? Step 2
The amount data available today Explore: Data quality
Data-driven applications have shown high value Visualize: A great way to understand data
Cleaning: Handle missing or faulty data
Step 3
Features: Select features to use in model
How did Data Science start? Model: Example in predicting rain/no rain (Tutorial + Video + Code + Project
(https://www.learnpythonwithrune.org/linear-classifier-from-scratch-explained-on-real-project/))
Data Analysis Helped Defeat Cholera (https://www.udacity.com/blog/2015/02/how-data-analysis-helped-
Analyze: Apply the prediction model
defeat-cholera.html)
Step 4
First Modern Weahter Forecast (https://en.wikipedia.org/wiki/Weather_forecasting#Modern_methods)
Present: Weather forecast
Visualize: Charts, maps, etc.
Credibility: Inaccurate results, too high confidence, not presenting full findings
Data Science: Understanding the Problem Step 5
Insights: What to wear, impact on outside events, etc.
Get the right question: Impact: Sales and weather forecast (umbrella, ice cream, etc.)
What is the problem we try to solve? Main goal: This is what makes Data Science valuable
This forms the Data Science problem
Examples
Sales figure and call center logs: evaluate a new product
Sensor data from multiple sensors: detect equipment failure Data Scientist Skills
Customer data + marketing data: better targeted marketing
Assess situation Beginners vs Experts Data Scientists
Risks, Benefits, Contingencies, Regulations, Resources, Requirement
Define goal
What is the objective?
What is the success criteria?
Conclusion
Defining the problem is key to successful Data Science projects

The Data Science Workflow


Feature Description

father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or
Fjob
'other')

reason reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')

guardian student's guardian (nominal: 'mother', 'father' or 'other')

traveltime home to school travel time (numeric: 1: <15 min., 2: 15-30 min., 3: 30 min. - 1 hour, or 4: >1 hour)

studytime weekly study time (numeric: 1: <2 hours, 2: 2 to 5 hours, 3: 5 to 10 hours, or 4: >10 hours)

failures number of past class failures (numeric: n if 1<=n<3, else 4)

schoolsup extra educational support (binary: yes or no)

famsup family educational support (binary: yes or no)

paid extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

Hard skills activities extra-curricular activities (binary: yes or no)

nursery attended nursery school (binary: yes or no)


Math and Statistics
Programming higher wants to take higher education (binary: yes or no)
Domain knowledge internet Internet access at home (binary: yes or no)
Data visualization
romantic with a romantic relationship (binary: yes or no)

famrel quality of family relationships (numeric: from 1 - very bad to 5 - excellent)


Soft skills
freetime free time after school (numeric: from 1 - very low to 5 - very high)
Curiosity
goout going out with friends (numeric: from 1 - very low to 5 - very high)
Communication
Storytelling Skills Dalc workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

Structured Thinking Walc weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

health current health status (numeric: from 1 - very bad to 5 - very good)

absences number of school absences (numeric: from 0 to 93)


Student Grade Prediction
Predict the final grade of Portugese high school students (source: Kaggle Targets
(https://www.kaggle.com/dipam7/student-grade-prediction))
Feature Description

G1 first period grade (numeric: from 0 to 20)


Features
G2 second period grade (numeric: from 0 to 20)
Feature Description
G3 final grade (numeric: from 0 to 20, output target)
school student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)

sex student's sex (binary: 'F' - female or 'M' - male)

age student's age (numeric: from 15 to 22) Problem: Propose activities to improve G3 grades.
address student's home address type (binary: 'U' - urban or 'R' - rural)
Our Goal
famsize family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
To guide the school how they helps students getting higher grades
Pstatus parent's cohabitation status (binary: 'T' - living together or 'A' - apart)

mother's education (numeric: 0: none, 1: primary education (4th grade), 2: 5th to 9th grade, 3: secondary education
Medu
or 4: higher education)

father's education (numeric: 0: none, 1: primary education (4th grade), 2: 5th to 9th grade, 3: secondary education
Fedu
or 4: higher education) Programming Notes:
mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or
Mjob Libraries used
'other')
pandas (https://pandas.pydata.org) - a data analysis and manipulation tool
Functionality and concepts used
u ct o a ty a d co cepts used
CSV (https://en.wikipedia.org/wiki/Comma-separated_values) file (Lecture on CSV In [1]:
(https://youtu.be/LEyojSOg4EI))
read_csv() (https://pandas.pydata.org/pandas- import pandas as pd
docs/stable/reference/api/pandas.read_csv.html) read a comma-separated values
(csv) file into pandas DataFrame.
In [2]:
corr() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)
Compute pairwise correlation of pandas DataFrame columns, excluding NA/null data = pd.read_csv('files/student-mat.csv')
values.
groupby() (https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.groupby.html) Grouping a pandas In [3]:
DataFrame column to apply a function on it. len(data)
mean() (https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.mean.html) Return the mean of the Out[3]:
values over the requested axis in a pandas DataFrame.
std() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.std.html) 395
Return sample standard deviation over requested axis of a pandas DataFrame.
count() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html) In [4]:
Count non-NA cells for each column or row of a pandas DataFrame. data.head()

Out[4]:

Step 1: Acquire school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel fre
Explore problem 0 GP F 18 U GT3 A 4 4 at_home teacher ... 4
Identify data
1 GP F 17 U GT3 T 1 1 at_home other ... 5
Import data
2 GP F 15 U LE3 T 1 1 at_home other ... 4

3 GP F 15 U GT3 T 4 2 health services ... 3

Get the right questions 4 GP F 16 U GT3 T 3 3 other other ... 4

This forms the data science problem


5 rows × 33 columns
What is the problem

Examples
In [5]:
Data Problem
data.columns
Sales figure and call center logs evaluate a new product

Sensor data from multiple sensors detect equipment failure Out[5]:


Customer data & marketing data better targeted marketing
Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fed
u',
'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
Understand context 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dal
Student age? c',
What is possible? 'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
What is the budget? dtype='object')

Step 2: Prepare
Data
Explore data
Import and read the data
Visualize ideas
Cleaning data
g
In [7]:

Are the data types as expected? data.isnull().any()

In [6]: Out[7]:

data.dtypes school False


sex False
age False
Out[6]:
address False
school object famsize False
sex object Pstatus False
age int64 Medu False
address object Fedu False
famsize object Mjob False
Pstatus object Fjob False
Medu int64 reason False
Fedu int64 guardian False
Mjob object traveltime False
Fjob object studytime False
reason object failures False
guardian object schoolsup False
traveltime int64 famsup False
studytime int64 paid False
failures int64 activities False
schoolsup object nursery False
famsup object higher False
paid object internet False
activities object romantic False
nursery object famrel False
higher object freetime False
internet object goout False
romantic object Dalc False
famrel int64 Walc False
freetime int64 health False
goout int64 absences False
Dalc int64 G1 False
Walc int64 G2 False
health int64 G3 False
absences int64 dtype: bool
G1 int64
G2 int64
G3 int64
dtype: object Step 3: Analyze
Feature selection
Are there missing values? Model selection
Analyze data*

Explore correlation

How to interpret correlation


In [12]:

data.groupby('higher')['G3'].mean()

Out[12]:

higher
no 6.800
yes 10.608
Name: G3, dtype: float64
In [10]:

data.corr()['G3'] In [13]:

data.groupby('higher')['G3'].count()
Out[10]:

age -0.161579 Out[13]:


Medu 0.217147
higher
Fedu 0.152457
no 20
traveltime -0.117142
yes 375
studytime 0.097820
Name: G3, dtype: int64
failures -0.360415
famrel 0.051363
freetime 0.011307
goout -0.132791 Standard deviation
Dalc -0.054660
Walc -0.051939
health -0.061335
absences 0.034247
G1 0.801468
G2 0.904868
G3 1.000000
Name: G3, dtype: float64

Notice that it only makes correlation with numeric features.

In [14]:
In [11]:
data.groupby('higher')['G3'].std()
data.columns

Out[14]:
Out[11]:
higher
Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fed
no 4.829732
u',
yes 4.493422
'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
Name: G3, dtype: float64
'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dal
c',
'Walc', 'health', 'absences', 'G1', 'G2', 'G3'], Step 4: Report
dtype='object')
Present findings
Visualize results
Find high impact features - but what can you do about it? Medu, Mjob
Credibility counts

Credibility

The impact we expect


In [15]:

data['G3'].mean()

Out[15]:

10.415189873417722

In [16]:

data.groupby('higher')['G3'].mean()

Out[16]:

higher
no 6.800
yes 10.608
Name: G3, dtype: float64

In [17]:

data.groupby('higher')['G3'].count()

Out[17]:

higher
no 20
yes 375
Name: G3, dtype: int64

Impact not the huge

Step 5: Actions
Use insights
Measure impact
Main goal

Impact measures

Same data after effort


Measure if 'higher' is lower
Did it influence G3?

In [ ]:

You might also like