00 - Lesson - Data Science Workflow - Jupyter Notebook
00 - Lesson - Data Science Workflow - Jupyter Notebook
father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or
Fjob
'other')
reason reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
traveltime home to school travel time (numeric: 1: <15 min., 2: 15-30 min., 3: 30 min. - 1 hour, or 4: >1 hour)
studytime weekly study time (numeric: 1: <2 hours, 2: 2 to 5 hours, 3: 5 to 10 hours, or 4: >10 hours)
paid extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
Structured Thinking Walc weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health current health status (numeric: from 1 - very bad to 5 - very good)
age student's age (numeric: from 15 to 22) Problem: Propose activities to improve G3 grades.
address student's home address type (binary: 'U' - urban or 'R' - rural)
Our Goal
famsize family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
To guide the school how they helps students getting higher grades
Pstatus parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
mother's education (numeric: 0: none, 1: primary education (4th grade), 2: 5th to 9th grade, 3: secondary education
Medu
or 4: higher education)
father's education (numeric: 0: none, 1: primary education (4th grade), 2: 5th to 9th grade, 3: secondary education
Fedu
or 4: higher education) Programming Notes:
mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or
Mjob Libraries used
'other')
pandas (https://pandas.pydata.org) - a data analysis and manipulation tool
Functionality and concepts used
u ct o a ty a d co cepts used
CSV (https://en.wikipedia.org/wiki/Comma-separated_values) file (Lecture on CSV In [1]:
(https://youtu.be/LEyojSOg4EI))
read_csv() (https://pandas.pydata.org/pandas- import pandas as pd
docs/stable/reference/api/pandas.read_csv.html) read a comma-separated values
(csv) file into pandas DataFrame.
In [2]:
corr() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)
Compute pairwise correlation of pandas DataFrame columns, excluding NA/null data = pd.read_csv('files/student-mat.csv')
values.
groupby() (https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.groupby.html) Grouping a pandas In [3]:
DataFrame column to apply a function on it. len(data)
mean() (https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.mean.html) Return the mean of the Out[3]:
values over the requested axis in a pandas DataFrame.
std() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.std.html) 395
Return sample standard deviation over requested axis of a pandas DataFrame.
count() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html) In [4]:
Count non-NA cells for each column or row of a pandas DataFrame. data.head()
Out[4]:
Step 1: Acquire school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel fre
Explore problem 0 GP F 18 U GT3 A 4 4 at_home teacher ... 4
Identify data
1 GP F 17 U GT3 T 1 1 at_home other ... 5
Import data
2 GP F 15 U LE3 T 1 1 at_home other ... 4
Examples
In [5]:
Data Problem
data.columns
Sales figure and call center logs evaluate a new product
Step 2: Prepare
Data
Explore data
Import and read the data
Visualize ideas
Cleaning data
g
In [7]:
In [6]: Out[7]:
Explore correlation
data.groupby('higher')['G3'].mean()
Out[12]:
higher
no 6.800
yes 10.608
Name: G3, dtype: float64
In [10]:
data.corr()['G3'] In [13]:
data.groupby('higher')['G3'].count()
Out[10]:
In [14]:
In [11]:
data.groupby('higher')['G3'].std()
data.columns
Out[14]:
Out[11]:
higher
Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fed
no 4.829732
u',
yes 4.493422
'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
Name: G3, dtype: float64
'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dal
c',
'Walc', 'health', 'absences', 'G1', 'G2', 'G3'], Step 4: Report
dtype='object')
Present findings
Visualize results
Find high impact features - but what can you do about it? Medu, Mjob
Credibility counts
Credibility
data['G3'].mean()
Out[15]:
10.415189873417722
In [16]:
data.groupby('higher')['G3'].mean()
Out[16]:
higher
no 6.800
yes 10.608
Name: G3, dtype: float64
In [17]:
data.groupby('higher')['G3'].count()
Out[17]:
higher
no 20
yes 375
Name: G3, dtype: int64
Step 5: Actions
Use insights
Measure impact
Main goal
Impact measures
In [ ]: