Dad Regression Using Python Statsmodel Formula PDF
Dad Regression Using Python Statsmodel Formula PDF
api):
Kumar Rahul
We will be using DAD hospital data in this exercise. Refer the Exhibit 1 to understand the feature list. Use the DAD
Hospital data and answer the below questions.
1. Load the dataset in Jupyter Notebook using pandas
2. Build a correlation matrix between all the numeric features in the dataset. Report the features, which are correlated at a cut-off of 0.70. What actions will
you take on the features, which are highly correlated?
3. Build a new feature named BMI using body height and body weight. Include this as a part of the data frame created in step 1.
4. Past medical history code has 175 instances of missing value (NaN). Impute ‘None’ as a label wherever the value is NaN for this feature.
5. Create a new data frame with the numeric features and categorical features as dummy variable coded features. Which features will you include for model
building and why?
6. Split the data into training set and test set. Use 80% of data for model training and 20% for model testing.
7. Build a model using age as independent variable and cost of treatment as dependent variable.
8. Build a model with statsmodel.api to estimate the total cost to hospital. How do you interpret the model outcome? Report the model performance on the
test set.
9. Build a model with statsmodel.formula.api to estimate the total cost to hospital and report the model performance on the test set. What difference do you
observe in the model built here and the one built in step 8.
10. Build a model using sklearn package to estimate the total cost to hospital. What difference do you observe in this model compared to model built in step 8
and 9.
11. Build a model using lasso, ridge and elastic net regression. What differences do you observe?
12. Build model using gradient descent to get an intuition about the inner working of optimization algorithms.
13. Build model using gradient descent with regularization to get an intution about the inner working of optimization algorithms.
PS: Not all the questions are being answered as a part of the same notebook. You are encouraged to answer the questions if you find them
missing.
Exhibit 1
13 Past Medical History Code Code given to the past medical history of the Patient
17 Key Complaints Code Codes given to the key complaints faced by the patient
23 Cost of Implant Total cost of all the implants done on the patient, if any
sys.executable
Out[1]: '/Users/Rahul/anaconda3/bin/python'
warnings.filterwarnings("ignore")
We are going to use below mentioned libraries for data import, processing and visulization. As we progress, we will use other specific libraries for model
building and evaluation.
Modify the ast_note_interactivity kernel option to see the value of multiple statements at once.
In [5]: pd.options.display.max_columns
pd.set_option('display.max_columns', None)
Out[5]: 20
Pandas will start looking from where your current python file is located. Therefore you can move from your current directory to where your data is located with
'..'
Out[6]:
sl
age gender marital_status key_complaints__code body_weight body_height hr_pulse bp__high bp_low rr past_medical_history_code hb urea c
no
In [7]: #?pd.read_csv
Dropping SL No as these will not be used for any analysis or model building.
In [8]: #?raw_df.drop()
raw_df.head()
Out[9]:
age gender marital_status key_complaints__code body_weight body_height hr_pulse bp__high bp_low rr past_medical_history_code hb urea creati
In [10]: raw_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 26 columns):
age 163 non-null float64
gender 163 non-null object
marital_status 163 non-null object
key_complaints__code 163 non-null object
body_weight 163 non-null int64
body_height 163 non-null int64
hr_pulse 163 non-null int64
bp__high 163 non-null int64
bp_low 163 non-null int64
rr 163 non-null int64
past_medical_history_code 163 non-null object
hb 163 non-null int64
urea 163 non-null int64
creatinine 163 non-null float64
mode_of_arrival 163 non-null object
state_at_the_time_of_arrival 163 non-null object
type_of_admsn 163 non-null object
total_cost_to_hospital 163 non-null float64
total_amount_billed_to_the_patient 163 non-null int64
concession 163 non-null int64
actual_receivable_amount 163 non-null int64
total_length_of_stay 163 non-null int64
length_of_stay___icu 163 non-null int64
length_of_stay__ward 163 non-null int64
implant_used 163 non-null object
cost_of_implant 163 non-null int64
dtypes: float64(3), int64(15), object(8)
memory usage: 33.2+ KB
In [11]: raw_df.describe(include='all').transpose()
#raw_df.describe().transpose()
Out[11]:
count unique top freq mean std min 25% 50% 75% max
gender 163 2 M 110 NaN NaN NaN NaN NaN NaN NaN
marital_status 163 2 UNMARRIED 85 NaN NaN NaN NaN NaN NaN NaN
key_complaints__code 163 13 other- heart 42 NaN NaN NaN NaN NaN NaN NaN
body_height 163 NaN NaN NaN 133.607 38.1152 19 110.5 151 162 185
bp__high 163 NaN NaN NaN 113.767 23.228 70 100 110 130 215
past_medical_history_code 163 7 None 105 NaN NaN NaN NaN NaN NaN NaN
creatinine 163 NaN NaN NaN 0.718405 0.461912 0.1 0.3 0.6 1 2.5
mode_of_arrival 163 3 WALKED IN 138 NaN NaN NaN NaN NaN NaN NaN
state_at_the_time_of_arrival 163 1 ALERT 163 NaN NaN NaN NaN NaN NaN NaN
type_of_admsn 163 2 ELECTIVE 138 NaN NaN NaN NaN NaN NaN NaN
total_cost_to_hospital 163 NaN NaN NaN 213398 136952 46093 137288 169951 247792 887350
total_amount_billed_to_the_patient 163 NaN NaN NaN 191525 116874 43641 150000 150000 209675 944819
concession 163 NaN NaN NaN 16100.8 20383.7 0 0 2196 37500 123132
actual_receivable_amount 163 NaN NaN NaN 178482 123157 31000 112500 137000 203722 848397
implant_used 163 2 N 126 NaN NaN NaN NaN NaN NaN NaN
Get numeric features from the data and find the corelation amongst numeric features
Out[12]: ['age',
'body_weight',
'body_height',
'hr_pulse',
'bp__high',
'bp_low',
'rr',
'hb',
'urea',
'creatinine',
'total_cost_to_hospital',
'total_amount_billed_to_the_patient',
'concession',
'actual_receivable_amount',
'total_length_of_stay',
'length_of_stay___icu',
'length_of_stay__ward',
'cost_of_implant']
Out[13]:
age body_weight body_height hr_pulse bp__high bp_low rr hb urea creatinine total_cost_to
age 1.000000 0.843398 0.722565 -0.451244 0.586568 0.465456 -0.234808 -0.218499 0.285690 0.708491
body_weight 0.843398 1.000000 0.846622 -0.534041 0.593387 0.482086 -0.307728 -0.147971 0.222444 0.714043
body_height 0.722565 0.846622 1.000000 -0.484088 0.511962 0.434896 -0.295007 -0.062932 0.225537 0.620786
hr_pulse -0.451244 -0.534041 -0.484088 1.000000 -0.291634 -0.207449 0.373234 0.099655 -0.024116 -0.334538
bp__high 0.586568 0.593387 0.511962 -0.291634 1.000000 0.772989 -0.083097 -0.083930 0.096395 0.443001
bp_low 0.465456 0.482086 0.434896 -0.207449 0.772989 1.000000 -0.015695 0.034689 0.043500 0.319224
rr -0.234808 -0.307728 -0.295007 0.373234 -0.083097 -0.015695 1.000000 0.035520 0.063190 -0.158310
hb -0.218499 -0.147971 -0.062932 0.099655 -0.083930 0.034689 0.035520 1.000000 -0.096701 -0.227718
urea 0.285690 0.222444 0.225537 -0.024116 0.096395 0.043500 0.063190 -0.096701 1.000000 0.639180
creatinine 0.708491 0.714043 0.620786 -0.334538 0.443001 0.319224 -0.158310 -0.227718 0.639180 1.000000
total_cost_to_hospital 0.499186 0.449536 0.390078 -0.060195 0.217561 0.211650 0.045726 -0.094229 0.280680 0.516058
total_amount_billed_to_the_patient 0.499330 0.446368 0.418448 -0.057116 0.226300 0.199455 0.069940 -0.101410 0.283243 0.499464
concession -0.387066 -0.429783 -0.309462 0.199744 -0.294828 -0.265444 0.195671 0.173086 -0.073098 -0.274000
actual_receivable_amount 0.549550 0.524614 0.473479 -0.103888 0.281007 0.262555 0.039106 -0.118508 0.283019 0.523746
total_length_of_stay 0.345171 0.178323 0.114701 0.009433 0.121619 0.107979 0.170249 -0.024840 0.236011 0.354600
length_of_stay___icu 0.494728 0.382562 0.277546 -0.080921 0.189863 0.141541 0.051388 -0.131131 0.254400 0.486857
length_of_stay__ward -0.013214 -0.133412 -0.116847 0.097868 -0.025814 0.007834 0.195577 0.104414 0.083921 0.016657
cost_of_implant 0.148869 0.277878 0.299271 -0.044194 -0.016220 0.061073 0.051949 -0.070642 0.247417 0.198562
Out[14]: ['gender',
'marital_status',
'key_complaints__code',
'past_medical_history_code',
'mode_of_arrival',
'state_at_the_time_of_arrival',
'type_of_admsn',
'implant_used']
list(filter_df.columns )
Out[15]: ['age',
'gender',
'marital_status',
'key_complaints__code',
'body_weight',
'body_height',
'hr_pulse',
'bp__high',
'bp_low',
'rr',
'past_medical_history_code',
'hb',
'urea',
'creatinine',
'mode_of_arrival',
'state_at_the_time_of_arrival',
'type_of_admsn',
'total_cost_to_hospital',
'total_amount_billed_to_the_patient',
'concession',
'actual_receivable_amount',
'total_length_of_stay',
'length_of_stay___icu',
'length_of_stay__ward',
'implant_used',
'cost_of_implant']
In [16]: filter_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 163 entries, 0 to 162
Data columns (total 26 columns):
age 163 non-null float64
gender 163 non-null object
marital_status 163 non-null object
key_complaints__code 163 non-null object
body_weight 163 non-null int64
body_height 163 non-null int64
hr_pulse 163 non-null int64
bp__high 163 non-null int64
bp_low 163 non-null int64
rr 163 non-null int64
past_medical_history_code 163 non-null object
hb 163 non-null int64
urea 163 non-null int64
creatinine 163 non-null float64
mode_of_arrival 163 non-null object
state_at_the_time_of_arrival 163 non-null object
type_of_admsn 163 non-null object
total_cost_to_hospital 163 non-null float64
total_amount_billed_to_the_patient 163 non-null int64
concession 163 non-null int64
actual_receivable_amount 163 non-null int64
total_length_of_stay 163 non-null int64
length_of_stay___icu 163 non-null int64
length_of_stay__ward 163 non-null int64
implant_used 163 non-null object
cost_of_implant 163 non-null int64
dtypes: float64(3), int64(15), object(8)
memory usage: 34.4+ KB
The unique labels in key_complaints__code is ['other- heart' 'CAD-DVD' 'CAD-TVD' 'RHD' 'CAD-SVD' 'other- respiratory'
'ACHD' 'other-tertalogy' 'other-nervous' 'PM-VSD' 'OS-ASD' 'CAD-VSD'
'other-general']
In [18]: filter_df['past_medical_history_code']=np.where(
(filter_df['past_medical_history_code'] =='hypertension1') |
(filter_df['past_medical_history_code'] =='hypertension2') |
(filter_df['past_medical_history_code'] =='hypertension3'),
'hypertension', filter_df['past_medical_history_code'])
filter_df['past_medical_history_code']=np.where(
(filter_df['past_medical_history_code'] =='Diabetes1') |
(filter_df['past_medical_history_code'] =='Diabetes2'),
'diabetes', filter_df['past_medical_history_code'])
filter_df['key_complaints__code']=np.where(
(filter_df['key_complaints__code'] =='other- respiratory') |
(filter_df['key_complaints__code'] =='PM-VSD') |
(filter_df['key_complaints__code'] =='CAD-SVD') |
(filter_df['key_complaints__code'] =='CAD-VSD') |
(filter_df['key_complaints__code'] =='other-nervous') |
(filter_df['key_complaints__code'] =='other-general'),
'others', filter_df['key_complaints__code'])
#filter_df.past_medical_history_code.value_counts()
We will use groupby function of pandas to summarize numerical features by each categorical feature.
Call the above function to group the numeric value by gender and marital_status
Out[20]:
age body_weight body_height hr_pulse bp__high bp_low rr hb urea creatinine total_cost_to_hospital total_amount_b
gender
F 23.770187 18.770461 39.486444 19.231781 21.427181 15.639002 2.957242 3.479256 21.327497 0.387486 103066.628024
M 27.320604 23.770195 36.038124 19.539624 23.704886 15.520658 4.119321 2.916763 16.152084 0.479554 149136.095813
Out[20]:
age body_weight body_height hr_pulse bp__high bp_low rr hb urea creatinine total_cost_to_hospital total_amount
gender
F 24.45283 31.301887 120.886792 92.150943 108.377358 67.867925 23.283019 13.169811 29.056604 0.571698 185361.458113
M 35.05300 43.527273 139.736364 90.318182 116.363636 73.300000 23.200000 13.227273 28.190909 0.789091 226907.024273
Calculating BMI
for c in categorical_features_set:
for n in numerical_features_set:
hist_plot(filter_df[n], filter_df[c], n,c)
In [25]: #dir(smf)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 163 entries, 0 to 162
Data columns (total 17 columns):
age 163 non-null float64
gender 163 non-null object
marital_status 163 non-null object
key_complaints__code 163 non-null object
hr_pulse 163 non-null int64
bp__high 163 non-null int64
bp_low 163 non-null int64
rr 163 non-null int64
past_medical_history_code 163 non-null object
hb 163 non-null int64
urea 163 non-null int64
mode_of_arrival 163 non-null object
type_of_admsn 163 non-null object
total_cost_to_hospital 163 non-null float64
implant_used 163 non-null object
cost_of_implant 163 non-null int64
bmi 163 non-null float64
dtypes: float64(3), int64(7), object(7)
memory usage: 22.9+ KB
Writing the formula with the required set of variables to be used in model building. Formula takes the form as Y~X.
Out[30]:
OLS Regression Results
Df Model: 25
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.83e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Out[32]: ['C(key_complaints__code)[T.CAD-DVD]',
'C(key_complaints__code)[T.CAD-TVD]',
'C(implant_used)[T.Y]',
'age']
Model Evaluation
For the model with dummy variable coding explicetely done, we need to add the constant term to the test set. For the model with dummy variable coding
carried out automatically, there is no need to add the constant term to the test set.
Here is the output with the model with no dummy variable coding
predict_train_df = regression_model.get_prediction(train_df)
predict_train_df.predicted_mean[0:5]
Out[33]: 84 135892.598214
2 333510.462816
94 183522.853799
45 216244.232684
42 268984.100132
dtype: float64
Out[38]:
actual predicted
55 220519.0 255575.535725
95 140545.0 126739.185201
End of Document