0% found this document useful (0 votes)
54 views

Dad Regression Using Python Statsmodel Formula PDF

The document discusses building regression models to predict the total cost of treatment using patient data from a hospital. It describes loading and preprocessing the data, including handling missing values, creating new features, and splitting the data into training and test sets. Various regression models are built using statsmodels, Scikit-learn, and gradient descent to estimate treatment costs and compare their performance on the test set. Key steps include adding a BMI feature, imputing missing values, and using different regression techniques like lasso, ridge, and elastic net.

Uploaded by

Shivesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Dad Regression Using Python Statsmodel Formula PDF

The document discusses building regression models to predict the total cost of treatment using patient data from a hospital. It describes loading and preprocessing the data, including handling missing values, creating new features, and splitting the data into training and test sets. Various regression models are built using statsmodels, Scikit-learn, and gradient descent to estimate treatment costs and compare their performance on the test set. Key steps include adding a BMI feature, imputing missing values, and using different regression techniques like lasso, ridge, and elastic net.

Uploaded by

Shivesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

OLS Regression using Python (stasmodel.formula.

api):

Kumar Rahul
We will be using DAD hospital data in this exercise. Refer the Exhibit 1 to understand the feature list. Use the DAD
Hospital data and answer the below questions.
1. Load the dataset in Jupyter Notebook using pandas
2. Build a correlation matrix between all the numeric features in the dataset. Report the features, which are correlated at a cut-off of 0.70. What actions will
you take on the features, which are highly correlated?
3. Build a new feature named BMI using body height and body weight. Include this as a part of the data frame created in step 1.
4. Past medical history code has 175 instances of missing value (NaN). Impute ‘None’ as a label wherever the value is NaN for this feature.
5. Create a new data frame with the numeric features and categorical features as dummy variable coded features. Which features will you include for model
building and why?
6. Split the data into training set and test set. Use 80% of data for model training and 20% for model testing.
7. Build a model using age as independent variable and cost of treatment as dependent variable.

Is age a significant feature in this model?


What inferences can be drawn from this model?

8. Build a model with statsmodel.api to estimate the total cost to hospital. How do you interpret the model outcome? Report the model performance on the
test set.
9. Build a model with statsmodel.formula.api to estimate the total cost to hospital and report the model performance on the test set. What difference do you
observe in the model built here and the one built in step 8.
10. Build a model using sklearn package to estimate the total cost to hospital. What difference do you observe in this model compared to model built in step 8
and 9.
11. Build a model using lasso, ridge and elastic net regression. What differences do you observe?
12. Build model using gradient descent to get an intuition about the inner working of optimization algorithms.
13. Build model using gradient descent with regularization to get an intution about the inner working of optimization algorithms.

PS: Not all the questions are being answered as a part of the same notebook. You are encouraged to answer the questions if you find them
missing.

Exhibit 1

Sl.No. Variable Description

1 Age Age of the patient in years

2 Body Weight Weight of the patient in Kilograms

3 Body Height Height of the patient in cm

4 HR Pulse Pulse of patient at the time of admission

5 BP-High High BP of patient (Systolic)

6 BP-Low Low BP of patient (Diastolic)

7 RR Respiratory rate of patient

8 HB Hemoglobin count of patient

9 Urea Urea levels of patient

10 Creatinine Creatinine levels of patient

11 Marital Status Marital status of the patient

12 Gender Gender code for patient

13 Past Medical History Code Code given to the past medical history of the Patient

14 Mode of Arrival Way in which the patient arrived the hospital

15 State at the Time of Arrival State in which the patient arrived

16 Type of Admission Type of admission for the patient

17 Key Complaints Code Codes given to the key complaints faced by the patient

18 Total Cost to Hospital Actual cost incurred by the hospital

19 Total Length of Stay Number of days patient stayed in the hospital

20 Length of Stay - ICU Number of days patient stayed in the ICU

21 Length of Stay - Ward Number of days patient stayed in the ward

22 Implant used (Y/N) Any implant done on the patient

23 Cost of Implant Total cost of all the implants done on the patient, if any

Code starts here


To know the environment with the pyhton kernal

In [1]: import sys, os

sys.executable

Out[1]: '/Users/Rahul/anaconda3/bin/python'

Suppress the warnings

In [2]: import warnings

warnings.filterwarnings("ignore")

We are going to use below mentioned libraries for data import, processing and visulization. As we progress, we will use other specific libraries for model
building and evaluation.

In [3]: import pandas as pd


import numpy as np
import seaborn as sn # visualization library based on matplotlib
import matplotlib.pylab as plt

#the output of plotting commands is displayed inline within Jupyter notebook


%matplotlib inline

Data Import and Manipulation

1. Importing a data set

Modify the ast_note_interactivity kernel option to see the value of multiple statements at once.

In [4]: from IPython.core.interactiveshell import InteractiveShell


InteractiveShell.ast_node_interactivity = "all"

Change the display settings for columns

In [5]: pd.options.display.max_columns
pd.set_option('display.max_columns', None)

Out[5]: 20

Pandas will start looking from where your current python file is located. Therefore you can move from your current directory to where your data is located with
'..'

The single period . means current working directory


The double period .. means parent of the current working directory

In [6]: raw_df = pd.read_csv( "../DAD_hospital/data/DAD_Case_Data_Corrected.csv",


sep = ',', na_values = ['', ' '])

raw_df.columns = raw_df.columns.str.lower().str.replace('.', '_')


raw_df.head()

Out[6]:
sl
age gender marital_status key_complaints__code body_weight body_height hr_pulse bp__high bp_low rr past_medical_history_code hb urea c
no

0 1 58.0 M MARRIED other- heart 49 160 118 100 80 32 None 11 33

1 2 59.0 M MARRIED CAD-DVD 41 155 78 70 50 28 None 11 95

2 3 82.0 M MARRIED CAD-TVD 47 164 100 110 80 20 Diabetes2 12 15

3 4 46.0 M MARRIED CAD-DVD 80 173 122 110 80 24 hypertension1 12 74

4 5 60.0 M MARRIED CAD-DVD 58 175 72 180 100 18 Diabetes2 10 48

In [7]: #?pd.read_csv

Dropping SL No as these will not be used for any analysis or model building.

In [8]: #?raw_df.drop()

In [9]: if set(['sl no']).issubset(raw_df.columns):


raw_df.drop(['sl no'],axis=1, inplace=True)

raw_df.head()

Out[9]:
age gender marital_status key_complaints__code body_weight body_height hr_pulse bp__high bp_low rr past_medical_history_code hb urea creati

0 58.0 M MARRIED other- heart 49 160 118 100 80 32 None 11 33

1 59.0 M MARRIED CAD-DVD 41 155 78 70 50 28 None 11 95

2 82.0 M MARRIED CAD-TVD 47 164 100 110 80 20 Diabetes2 12 15

3 46.0 M MARRIED CAD-DVD 80 173 122 110 80 24 hypertension1 12 74

4 60.0 M MARRIED CAD-DVD 58 175 72 180 100 18 Diabetes2 10 48

2. Structure of the dataset

In [10]: raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 26 columns):
age 163 non-null float64
gender 163 non-null object
marital_status 163 non-null object
key_complaints__code 163 non-null object
body_weight 163 non-null int64
body_height 163 non-null int64
hr_pulse 163 non-null int64
bp__high 163 non-null int64
bp_low 163 non-null int64
rr 163 non-null int64
past_medical_history_code 163 non-null object
hb 163 non-null int64
urea 163 non-null int64
creatinine 163 non-null float64
mode_of_arrival 163 non-null object
state_at_the_time_of_arrival 163 non-null object
type_of_admsn 163 non-null object
total_cost_to_hospital 163 non-null float64
total_amount_billed_to_the_patient 163 non-null int64
concession 163 non-null int64
actual_receivable_amount 163 non-null int64
total_length_of_stay 163 non-null int64
length_of_stay___icu 163 non-null int64
length_of_stay__ward 163 non-null int64
implant_used 163 non-null object
cost_of_implant 163 non-null int64
dtypes: float64(3), int64(15), object(8)
memory usage: 33.2+ KB

In [11]: raw_df.describe(include='all').transpose()
#raw_df.describe().transpose()

Out[11]:
count unique top freq mean std min 25% 50% 75% max

age 163 NaN NaN NaN 31.6063 26.6156 0.83 6 21 58 88

gender 163 2 M 110 NaN NaN NaN NaN NaN NaN NaN

marital_status 163 2 UNMARRIED 85 NaN NaN NaN NaN NaN NaN NaN

key_complaints__code 163 13 other- heart 42 NaN NaN NaN NaN NaN NaN NaN

body_weight 163 NaN NaN NaN 39.5521 22.9404 3 16.5 43 59.5 85

body_height 163 NaN NaN NaN 133.607 38.1152 19 110.5 151 162 185

hr_pulse 163 NaN NaN NaN 90.9141 19.3998 58 76 90 102 140

bp__high 163 NaN NaN NaN 113.767 23.228 70 100 110 130 215

bp_low 163 NaN NaN NaN 71.5337 15.7195 40 60 70 80 140

rr 163 NaN NaN NaN 23.227 3.77173 12 22 24 24 42

past_medical_history_code 163 7 None 105 NaN NaN NaN NaN NaN NaN NaN

hb 163 NaN NaN NaN 13.2086 3.10009 8 11 13 14.5 26

urea 163 NaN NaN NaN 28.4724 17.9362 15 18 24 32 143

creatinine 163 NaN NaN NaN 0.718405 0.461912 0.1 0.3 0.6 1 2.5

mode_of_arrival 163 3 WALKED IN 138 NaN NaN NaN NaN NaN NaN NaN

state_at_the_time_of_arrival 163 1 ALERT 163 NaN NaN NaN NaN NaN NaN NaN

type_of_admsn 163 2 ELECTIVE 138 NaN NaN NaN NaN NaN NaN NaN

total_cost_to_hospital 163 NaN NaN NaN 213398 136952 46093 137288 169951 247792 887350

total_amount_billed_to_the_patient 163 NaN NaN NaN 191525 116874 43641 150000 150000 209675 944819

concession 163 NaN NaN NaN 16100.8 20383.7 0 0 2196 37500 123132

actual_receivable_amount 163 NaN NaN NaN 178482 123157 31000 112500 137000 203722 848397

total_length_of_stay 163 NaN NaN NaN 12.0061 5.46763 3 8 11 14 41

length_of_stay___icu 163 NaN NaN NaN 3.84049 3.85049 0 1 2 5 22

length_of_stay__ward 163 NaN NaN NaN 8.18405 3.80908 0 6 7 10 22

implant_used 163 2 N 126 NaN NaN NaN NaN NaN NaN NaN

cost_of_implant 163 NaN NaN NaN 10137.3 24193.8 0 0 0 0 196848

Get numeric features from the data and find the corelation amongst numeric features

In [12]: numerical_features = [x for x in raw_df.select_dtypes(include=[np.number])]


numerical_features

Out[12]: ['age',
'body_weight',
'body_height',
'hr_pulse',
'bp__high',
'bp_low',
'rr',
'hb',
'urea',
'creatinine',
'total_cost_to_hospital',
'total_amount_billed_to_the_patient',
'concession',
'actual_receivable_amount',
'total_length_of_stay',
'length_of_stay___icu',
'length_of_stay__ward',
'cost_of_implant']

In [13]: numerical_features_df = raw_df.select_dtypes(include=[np.number])


numerical_features_df.corr()

Out[13]:
age body_weight body_height hr_pulse bp__high bp_low rr hb urea creatinine total_cost_to

age 1.000000 0.843398 0.722565 -0.451244 0.586568 0.465456 -0.234808 -0.218499 0.285690 0.708491

body_weight 0.843398 1.000000 0.846622 -0.534041 0.593387 0.482086 -0.307728 -0.147971 0.222444 0.714043

body_height 0.722565 0.846622 1.000000 -0.484088 0.511962 0.434896 -0.295007 -0.062932 0.225537 0.620786

hr_pulse -0.451244 -0.534041 -0.484088 1.000000 -0.291634 -0.207449 0.373234 0.099655 -0.024116 -0.334538

bp__high 0.586568 0.593387 0.511962 -0.291634 1.000000 0.772989 -0.083097 -0.083930 0.096395 0.443001

bp_low 0.465456 0.482086 0.434896 -0.207449 0.772989 1.000000 -0.015695 0.034689 0.043500 0.319224

rr -0.234808 -0.307728 -0.295007 0.373234 -0.083097 -0.015695 1.000000 0.035520 0.063190 -0.158310

hb -0.218499 -0.147971 -0.062932 0.099655 -0.083930 0.034689 0.035520 1.000000 -0.096701 -0.227718

urea 0.285690 0.222444 0.225537 -0.024116 0.096395 0.043500 0.063190 -0.096701 1.000000 0.639180

creatinine 0.708491 0.714043 0.620786 -0.334538 0.443001 0.319224 -0.158310 -0.227718 0.639180 1.000000

total_cost_to_hospital 0.499186 0.449536 0.390078 -0.060195 0.217561 0.211650 0.045726 -0.094229 0.280680 0.516058

total_amount_billed_to_the_patient 0.499330 0.446368 0.418448 -0.057116 0.226300 0.199455 0.069940 -0.101410 0.283243 0.499464

concession -0.387066 -0.429783 -0.309462 0.199744 -0.294828 -0.265444 0.195671 0.173086 -0.073098 -0.274000

actual_receivable_amount 0.549550 0.524614 0.473479 -0.103888 0.281007 0.262555 0.039106 -0.118508 0.283019 0.523746

total_length_of_stay 0.345171 0.178323 0.114701 0.009433 0.121619 0.107979 0.170249 -0.024840 0.236011 0.354600

length_of_stay___icu 0.494728 0.382562 0.277546 -0.080921 0.189863 0.141541 0.051388 -0.131131 0.254400 0.486857

length_of_stay__ward -0.013214 -0.133412 -0.116847 0.097868 -0.025814 0.007834 0.195577 0.104414 0.083921 0.016657

cost_of_implant 0.148869 0.277878 0.299271 -0.044194 -0.016220 0.061073 0.051949 -0.070642 0.247417 0.198562

In [14]: categorical_features = [x for x in raw_df.select_dtypes(include=[np.object])]


categorical_features

Out[14]: ['gender',
'marital_status',
'key_complaints__code',
'past_medical_history_code',
'mode_of_arrival',
'state_at_the_time_of_arrival',
'type_of_admsn',
'implant_used']

2. Summarizing the dataset


Create a new data frame and store the raw data copy. This is being done to have a copy of the raw data intact for further manipulation if needed. The dropna()
function is used for row wise deletion of missing value. The axis = 0 means row-wise, 1 means column wise.

In [15]: filter_df = raw_df.dropna(axis=0, how='any', thresh=None,


subset=None, inplace=False)

list(filter_df.columns )

Out[15]: ['age',
'gender',
'marital_status',
'key_complaints__code',
'body_weight',
'body_height',
'hr_pulse',
'bp__high',
'bp_low',
'rr',
'past_medical_history_code',
'hb',
'urea',
'creatinine',
'mode_of_arrival',
'state_at_the_time_of_arrival',
'type_of_admsn',
'total_cost_to_hospital',
'total_amount_billed_to_the_patient',
'concession',
'actual_receivable_amount',
'total_length_of_stay',
'length_of_stay___icu',
'length_of_stay__ward',
'implant_used',
'cost_of_implant']

In [16]: filter_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163 entries, 0 to 162
Data columns (total 26 columns):
age 163 non-null float64
gender 163 non-null object
marital_status 163 non-null object
key_complaints__code 163 non-null object
body_weight 163 non-null int64
body_height 163 non-null int64
hr_pulse 163 non-null int64
bp__high 163 non-null int64
bp_low 163 non-null int64
rr 163 non-null int64
past_medical_history_code 163 non-null object
hb 163 non-null int64
urea 163 non-null int64
creatinine 163 non-null float64
mode_of_arrival 163 non-null object
state_at_the_time_of_arrival 163 non-null object
type_of_admsn 163 non-null object
total_cost_to_hospital 163 non-null float64
total_amount_billed_to_the_patient 163 non-null int64
concession 163 non-null int64
actual_receivable_amount 163 non-null int64
total_length_of_stay 163 non-null int64
length_of_stay___icu 163 non-null int64
length_of_stay__ward 163 non-null int64
implant_used 163 non-null object
cost_of_implant 163 non-null int64
dtypes: float64(3), int64(15), object(8)
memory usage: 34.4+ KB

We will first start by printing the unique labels in categorical features

In [17]: for f in categorical_features:


print("\nThe unique labels in {} is {}\n".format(f, filter_df[f].unique()))
print("The values in {} is \n{}\n".format(f, filter_df[f].value_counts()))

The unique labels in gender is ['M' 'F']

The values in gender is


M 110
F 53
Name: gender, dtype: int64

The unique labels in marital_status is ['MARRIED' 'UNMARRIED']

The values in marital_status is


UNMARRIED 85
MARRIED 78
Name: marital_status, dtype: int64

The unique labels in key_complaints__code is ['other- heart' 'CAD-DVD' 'CAD-TVD' 'RHD' 'CAD-SVD' 'other- respiratory'
'ACHD' 'other-tertalogy' 'other-nervous' 'PM-VSD' 'OS-ASD' 'CAD-VSD'
'other-general']

The values in key_complaints__code is


other- heart 42
CAD-DVD 26
CAD-TVD 22
RHD 19
ACHD 16
OS-ASD 13
other-tertalogy 11
other- respiratory 5
PM-VSD 4
CAD-SVD 2
other-nervous 1
other-general 1
CAD-VSD 1
Name: key_complaints__code, dtype: int64

The unique labels in past_medical_history_code is ['None' 'Diabetes2' 'hypertension1' 'hypertension3' 'hypertension2'


'Diabetes1' 'other']

The values in past_medical_history_code is


None 105
hypertension1 16
other 14
Diabetes1 9
Diabetes2 9
hypertension2 7
hypertension3 3
Name: past_medical_history_code, dtype: int64

The unique labels in mode_of_arrival is ['AMBULANCE' 'WALKED IN' 'TRANSFERRED']

The values in mode_of_arrival is


WALKED IN 138
AMBULANCE 23
TRANSFERRED 2
Name: mode_of_arrival, dtype: int64

The unique labels in state_at_the_time_of_arrival is ['ALERT']

The values in state_at_the_time_of_arrival is


ALERT 163
Name: state_at_the_time_of_arrival, dtype: int64

The unique labels in type_of_admsn is ['EMERGENCY' 'ELECTIVE']

The values in type_of_admsn is


ELECTIVE 138
EMERGENCY 25
Name: type_of_admsn, dtype: int64

The unique labels in implant_used is ['Y' 'N']

The values in implant_used is


N 126
Y 37
Name: implant_used, dtype: int64

Clubbing some of the feature labels together

In [18]: filter_df['past_medical_history_code']=np.where(
(filter_df['past_medical_history_code'] =='hypertension1') |
(filter_df['past_medical_history_code'] =='hypertension2') |
(filter_df['past_medical_history_code'] =='hypertension3'),
'hypertension', filter_df['past_medical_history_code'])

filter_df['past_medical_history_code']=np.where(
(filter_df['past_medical_history_code'] =='Diabetes1') |
(filter_df['past_medical_history_code'] =='Diabetes2'),
'diabetes', filter_df['past_medical_history_code'])

filter_df['key_complaints__code']=np.where(
(filter_df['key_complaints__code'] =='other- respiratory') |
(filter_df['key_complaints__code'] =='PM-VSD') |
(filter_df['key_complaints__code'] =='CAD-SVD') |
(filter_df['key_complaints__code'] =='CAD-VSD') |
(filter_df['key_complaints__code'] =='other-nervous') |
(filter_df['key_complaints__code'] =='other-general'),
'others', filter_df['key_complaints__code'])

#filter_df.past_medical_history_code.value_counts()

We will use groupby function of pandas to summarize numerical features by each categorical feature.

In [19]: def group_by (categorical_features):


std = filter_df.groupby(categorical_features).std()
mean = filter_df.groupby(categorical_features).mean()
return std, mean

Call the above function to group the numeric value by gender and marital_status

In [20]: s,m =group_by('gender')


s
m

Out[20]:
age body_weight body_height hr_pulse bp__high bp_low rr hb urea creatinine total_cost_to_hospital total_amount_b

gender

F 23.770187 18.770461 39.486444 19.231781 21.427181 15.639002 2.957242 3.479256 21.327497 0.387486 103066.628024

M 27.320604 23.770195 36.038124 19.539624 23.704886 15.520658 4.119321 2.916763 16.152084 0.479554 149136.095813

Out[20]:
age body_weight body_height hr_pulse bp__high bp_low rr hb urea creatinine total_cost_to_hospital total_amount

gender

F 24.45283 31.301887 120.886792 92.150943 108.377358 67.867925 23.283019 13.169811 29.056604 0.571698 185361.458113

M 35.05300 43.527273 139.736364 90.318182 116.363636 73.300000 23.200000 13.227273 28.190909 0.789091 226907.024273

Calculating BMI

In [21]: filter_df['bmi'] = filter_df.body_weight/(np.power((filter_df.body_height/100),2))

3. Visualizing the Data


Plot can be done using the callable functions of

1. pandas library (http://pandas.pydata.org/pandas-docs/stable/visualization.html)


2. matplotlib library (https://matplotlib.org/) or
3. seaborn library (https://seaborn.pydata.org/) which is based on matplotlib and provides interface for drawing attractive statistical graphics.

3a. Visualizing the Data using pandas

In [22]: def hist_plot(data, group_by, xlabel,ylabel):


pd.crosstab(data,group_by).plot(kind='hist')
plt.xlabel(xlabel, size = 14)
plt.ylabel(ylabel, size = 14)
plt.title('Plot', size = 18)
plt.grid(True)
x1,x2,y1,y2 = plt.axis()
plt.axis((0,x2,y1,y2))
plt.show()
#plt.subplot(1, 2)

In [23]: numerical_features_set = ['age','rr']


categorical_features_set = ['gender','marital_status']

for c in categorical_features_set:
for n in numerical_features_set:
hist_plot(filter_df[n], filter_df[c], n,c)

Model Approach 2: Without dummy variable coding


In [24]: import statsmodels.formula.api as smf

To print the name of all the models in any library

In [25]: #dir(smf)

In [26]: X_features = [x for x in filter_df if x not in ['body_weight','body_height',


'creatinine','state_at_the_time_of_arrival',
'total_amount_billed_to_the_patient','concession',
'actual_receivable_amount','total_length_of_stay',
'length_of_stay___icu','length_of_stay__ward']]

In [27]: new_df = filter_df.filter(X_features, axis =1)

new_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163 entries, 0 to 162
Data columns (total 17 columns):
age 163 non-null float64
gender 163 non-null object
marital_status 163 non-null object
key_complaints__code 163 non-null object
hr_pulse 163 non-null int64
bp__high 163 non-null int64
bp_low 163 non-null int64
rr 163 non-null int64
past_medical_history_code 163 non-null object
hb 163 non-null int64
urea 163 non-null int64
mode_of_arrival 163 non-null object
type_of_admsn 163 non-null object
total_cost_to_hospital 163 non-null float64
implant_used 163 non-null object
cost_of_implant 163 non-null int64
bmi 163 non-null float64
dtypes: float64(3), int64(7), object(7)
memory usage: 22.9+ KB

In [28]: from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split( new_df, test_size = 0.2, random_state = 42)

Writing the formula with the required set of variables to be used in model building. Formula takes the form as Y~X.

In [29]: pass_formula = 'total_cost_to_hospital ~ \


C(gender) + \
C(marital_status) + \
C(key_complaints__code) + \
C(past_medical_history_code) + \
C(mode_of_arrival) + \
C(type_of_admsn)+ \
C(implant_used) + \
age + hr_pulse + bp__high + bp_low + \
rr +hb + urea + cost_of_implant + bmi'

In [30]: regression_model = smf.ols(formula=pass_formula, data=train_df).fit()


regression_model.summary()

Out[30]:
OLS Regression Results

Dep. Variable: total_cost_to_hospital R-squared: 0.567

Model: OLS Adj. R-squared: 0.463

Method: Least Squares F-statistic: 5.447

Date: Tue, 21 Jul 2020 Prob (F-statistic): 3.14e-10

Time: 11:38:56 Log-Likelihood: -1666.9

No. Observations: 130 AIC: 3386.

Df Residuals: 104 BIC: 3460.

Df Model: 25

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept -2.238e+05 1.66e+05 -1.345 0.182 -5.54e+05 1.06e+05

C(gender)[T.M] 1.296e+04 2.16e+04 0.599 0.550 -2.99e+04 5.58e+04

C(marital_status)[T.UNMARRIED] 6.486e+04 4.41e+04 1.470 0.145 -2.27e+04 1.52e+05

C(key_complaints__code)[T.CAD-DVD] 1.433e+05 4.78e+04 2.999 0.003 4.85e+04 2.38e+05

C(key_complaints__code)[T.CAD-TVD] 1.238e+05 4.93e+04 2.508 0.014 2.59e+04 2.22e+05

C(key_complaints__code)[T.OS-ASD] 4.156e+04 4.4e+04 0.946 0.347 -4.56e+04 1.29e+05

C(key_complaints__code)[T.RHD] -5.186e+04 5e+04 -1.037 0.302 -1.51e+05 4.73e+04

C(key_complaints__code)[T.other- heart] 4.871e+04 3.52e+04 1.384 0.169 -2.11e+04 1.19e+05

C(key_complaints__code)[T.other-tertalogy] 5.896e+04 4.68e+04 1.259 0.211 -3.39e+04 1.52e+05

C(key_complaints__code)[T.others] 7333.3118 4.48e+04 0.164 0.870 -8.15e+04 9.61e+04

C(past_medical_history_code)[T.diabetes] 3.587e+04 4.13e+04 0.869 0.387 -4.6e+04 1.18e+05

C(past_medical_history_code)[T.hypertension] -3.264e+04 3.3e+04 -0.989 0.325 -9.81e+04 3.28e+04

C(past_medical_history_code)[T.other] -3.735e+04 3.56e+04 -1.048 0.297 -1.08e+05 3.33e+04

C(mode_of_arrival)[T.TRANSFERRED] 1.257e+05 1.57e+05 0.799 0.426 -1.86e+05 4.38e+05

C(mode_of_arrival)[T.WALKED IN] 1.14e+05 1.1e+05 1.032 0.304 -1.05e+05 3.33e+05

C(type_of_admsn)[T.EMERGENCY] 1.394e+05 1.08e+05 1.287 0.201 -7.55e+04 3.54e+05

C(implant_used)[T.Y] 1.316e+05 5.1e+04 2.580 0.011 3.05e+04 2.33e+05

age 2013.2379 973.038 2.069 0.041 83.668 3942.808

hr_pulse 937.4968 567.707 1.651 0.102 -188.288 2063.281

bp__high 319.9341 741.341 0.432 0.667 -1150.174 1790.042

bp_low -1010.1327 1001.742 -1.008 0.316 -2996.625 976.359

rr 3040.9463 2717.108 1.119 0.266 -2347.181 8429.074

hb -636.4407 3220.833 -0.198 0.844 -7023.474 5750.592

urea 393.4563 653.097 0.602 0.548 -901.660 1688.572

cost_of_implant 1.2582 0.978 1.286 0.201 -0.681 3.198

bmi -91.9391 249.000 -0.369 0.713 -585.716 401.838

Omnibus: 82.903 Durbin-Watson: 1.987

Prob(Omnibus): 0.000 Jarque-Bera (JB): 626.472

Skew: 2.096 Prob(JB): 9.19e-137

Kurtosis: 12.904 Cond. No. 5.83e+05

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.83e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Find the significant variables


In [31]: def get_significant_vars (modelobject):
var_p_vals_df = pd.DataFrame(modelobject.pvalues)
var_p_vals_df['vars'] = var_p_vals_df.index
var_p_vals_df.columns = ['pvals', 'vars']
return list(var_p_vals_df[var_p_vals_df.pvals <= 0.05]['vars'])

In [32]: significant_vars = get_significant_vars(regression_model)


significant_vars

Out[32]: ['C(key_complaints__code)[T.CAD-DVD]',
'C(key_complaints__code)[T.CAD-TVD]',
'C(implant_used)[T.Y]',
'age']

Model Evaluation

1. The prediction on train data.


Two ways to precit the outcome on the train set

Use predict function of the model object


Use get_prediction function of the model object

For the model with dummy variable coding explicetely done, we need to add the constant term to the test set. For the model with dummy variable coding
carried out automatically, there is no need to add the constant term to the test set.

Here is the output with the model with no dummy variable coding

In [33]: predict_train_df = regression_model.predict((train_df))


predict_train_df.head()

predict_train_df = regression_model.get_prediction(train_df)
predict_train_df.predicted_mean[0:5]

Out[33]: 84 135892.598214
2 333510.462816
94 183522.853799
45 216244.232684
42 268984.100132
dtype: float64

Out[33]: array([135892.59821423, 333510.46281619, 183522.85379855, 216244.23268421,


268984.10013156])

2. Model Evaluation - heteroscedasticity

In [34]: pred_val = regression_model.fittedvalues.copy()


true_val = train_df['total_cost_to_hospital'].values.copy()
residual = true_val - pred_val

In [35]: plt.scatter(residual, pred_val)

Out[35]: <matplotlib.collections.PathCollection at 0x7fec3e704ef0>

3. Model Evaluation - Test for Normality

In [36]: import statsmodels.api as sm


normality_plot = sm.qqplot(residual,line = 'r')

4. The prediction on test data.


The prediction can be carried out by defining functions as well. Below is one such instance wherein a function is defined and is used for prediction

In [37]: def get_predictions ( test_actual, model, test_data ):


y_pred_df = pd.DataFrame( { 'actual': test_actual,
'predicted': model.get_prediction((test_data)).predicted_mean})
return y_pred_df

In [38]: predict_test_df = get_predictions( test_df.total_cost_to_hospital, regression_model, test_df)


predict_test_df.head()

Out[38]:
actual predicted

135 73218.0 174963.121210

115 144134.0 264900.398145

131 109117.0 96844.282918

55 220519.0 255575.535725

95 140545.0 126739.185201

End of Document

You might also like