0% found this document useful (0 votes)

54 views

Dad Regression Using Python Statsmodel Formula PDF

The document discusses building regression models to predict the total cost of treatment using patient data from a hospital. It describes loading and preprocessing the data, including handling missing values, creating new features, and splitting the data into training and test sets. Various regression models are built using statsmodels, Scikit-learn, and gradient descent to estimate treatment costs and compare their performance on the test set. Key steps include adding a BMI feature, imputing missing values, and using different regression techniques like lasso, ridge, and elastic net.

Uploaded by

Shivesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Dad Regression Using Python Statsmodel Formula PDF

Uploaded by

Shivesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

OLS Regression using Python (stasmodel.formula.

api):

Kumar Rahul
We will be using DAD hospital data in this exercise. Refer the Exhibit 1 to understand the feature list. Use the DAD
Hospital data and answer the below questions.
1. Load the dataset in Jupyter Notebook using pandas
2. Build a correlation matrix between all the numeric features in the dataset. Report the features, which are correlated at a cut-off of 0.70. What actions will
you take on the features, which are highly correlated?
3. Build a new feature named BMI using body height and body weight. Include this as a part of the data frame created in step 1.
4. Past medical history code has 175 instances of missing value (NaN). Impute ‘None’ as a label wherever the value is NaN for this feature.
5. Create a new data frame with the numeric features and categorical features as dummy variable coded features. Which features will you include for model
building and why?
6. Split the data into training set and test set. Use 80% of data for model training and 20% for model testing.
7. Build a model using age as independent variable and cost of treatment as dependent variable.

Is age a significant feature in this model?

What inferences can be drawn from this model?

8. Build a model with statsmodel.api to estimate the total cost to hospital. How do you interpret the model outcome? Report the model performance on the
test set.
9. Build a model with statsmodel.formula.api to estimate the total cost to hospital and report the model performance on the test set. What difference do you
observe in the model built here and the one built in step 8.
10. Build a model using sklearn package to estimate the total cost to hospital. What difference do you observe in this model compared to model built in step 8
and 9.
11. Build a model using lasso, ridge and elastic net regression. What differences do you observe?
12. Build model using gradient descent to get an intuition about the inner working of optimization algorithms.
13. Build model using gradient descent with regularization to get an intution about the inner working of optimization algorithms.

PS: Not all the questions are being answered as a part of the same notebook. You are encouraged to answer the questions if you find them
missing.

Exhibit 1

Sl.No. Variable Description

1 Age Age of the patient in years

2 Body Weight Weight of the patient in Kilograms

3 Body Height Height of the patient in cm

4 HR Pulse Pulse of patient at the time of admission

5 BP-High High BP of patient (Systolic)

6 BP-Low Low BP of patient (Diastolic)

7 RR Respiratory rate of patient

8 HB Hemoglobin count of patient

9 Urea Urea levels of patient

10 Creatinine Creatinine levels of patient

11 Marital Status Marital status of the patient

12 Gender Gender code for patient

13 Past Medical History Code Code given to the past medical history of the Patient

14 Mode of Arrival Way in which the patient arrived the hospital

15 State at the Time of Arrival State in which the patient arrived

16 Type of Admission Type of admission for the patient

17 Key Complaints Code Codes given to the key complaints faced by the patient

18 Total Cost to Hospital Actual cost incurred by the hospital

19 Total Length of Stay Number of days patient stayed in the hospital

20 Length of Stay - ICU Number of days patient stayed in the ICU

21 Length of Stay - Ward Number of days patient stayed in the ward

22 Implant used (Y/N) Any implant done on the patient

23 Cost of Implant Total cost of all the implants done on the patient, if any

Code starts here

To know the environment with the pyhton kernal

In [1]: import sys, os

sys.executable

Out[1]: '/Users/Rahul/anaconda3/bin/python'

Suppress the warnings

In [2]: import warnings

warnings.filterwarnings("ignore")

We are going to use below mentioned libraries for data import, processing and visulization. As we progress, we will use other specific libraries for model
building and evaluation.

In [3]: import pandas as pd

import numpy as np
import seaborn as sn # visualization library based on matplotlib
import matplotlib.pylab as plt

#the output of plotting commands is displayed inline within Jupyter notebook

%matplotlib inline

Data Import and Manipulation

1. Importing a data set

Modify the ast_note_interactivity kernel option to see the value of multiple statements at once.

In [4]: from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

Change the display settings for columns

In [5]: pd.options.display.max_columns
pd.set_option('display.max_columns', None)

Out[5]: 20

Pandas will start looking from where your current python file is located. Therefore you can move from your current directory to where your data is located with
'..'

The single period . means current working directory

The double period .. means parent of the current working directory

In [6]: raw_df = pd.read_csv( "../DAD_hospital/data/DAD_Case_Data_Corrected.csv",

sep = ',', na_values = ['', ' '])

raw_df.columns = raw_df.columns.str.lower().str.replace('.', '_')

raw_df.head()

Out[6]:
sl
age gender marital_status key_complaints__code body_weight body_height hr_pulse bp__high bp_low rr past_medical_history_code hb urea c
no

0 1 58.0 M MARRIED other- heart 49 160 118 100 80 32 None 11 33

1 2 59.0 M MARRIED CAD-DVD 41 155 78 70 50 28 None 11 95

2 3 82.0 M MARRIED CAD-TVD 47 164 100 110 80 20 Diabetes2 12 15

3 4 46.0 M MARRIED CAD-DVD 80 173 122 110 80 24 hypertension1 12 74

4 5 60.0 M MARRIED CAD-DVD 58 175 72 180 100 18 Diabetes2 10 48

In [7]: #?pd.read_csv

Dropping SL No as these will not be used for any analysis or model building.

In [8]: #?raw_df.drop()

In [9]: if set(['sl no']).issubset(raw_df.columns):

raw_df.drop(['sl no'],axis=1, inplace=True)

raw_df.head()

Out[9]:
age gender marital_status key_complaints__code body_weight body_height hr_pulse bp__high bp_low rr past_medical_history_code hb urea creati

0 58.0 M MARRIED other- heart 49 160 118 100 80 32 None 11 33

1 59.0 M MARRIED CAD-DVD 41 155 78 70 50 28 None 11 95

2 82.0 M MARRIED CAD-TVD 47 164 100 110 80 20 Diabetes2 12 15

3 46.0 M MARRIED CAD-DVD 80 173 122 110 80 24 hypertension1 12 74

4 60.0 M MARRIED CAD-DVD 58 175 72 180 100 18 Diabetes2 10 48

2. Structure of the dataset

In [10]: raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 26 columns):
age 163 non-null float64
gender 163 non-null object
marital_status 163 non-null object
key_complaints__code 163 non-null object
body_weight 163 non-null int64
body_height 163 non-null int64
hr_pulse 163 non-null int64
bp__high 163 non-null int64
bp_low 163 non-null int64
rr 163 non-null int64
past_medical_history_code 163 non-null object
hb 163 non-null int64
urea 163 non-null int64
creatinine 163 non-null float64
mode_of_arrival 163 non-null object
state_at_the_time_of_arrival 163 non-null object
type_of_admsn 163 non-null object
total_cost_to_hospital 163 non-null float64
total_amount_billed_to_the_patient 163 non-null int64
concession 163 non-null int64
actual_receivable_amount 163 non-null int64
total_length_of_stay 163 non-null int64
length_of_stay___icu 163 non-null int64
length_of_stay__ward 163 non-null int64
implant_used 163 non-null object
cost_of_implant 163 non-null int64
dtypes: float64(3), int64(15), object(8)
memory usage: 33.2+ KB

In [11]: raw_df.describe(include='all').transpose()
#raw_df.describe().transpose()

Out[11]:
count unique top freq mean std min 25% 50% 75% max

age 163 NaN NaN NaN 31.6063 26.6156 0.83 6 21 58 88

gender 163 2 M 110 NaN NaN NaN NaN NaN NaN NaN

marital_status 163 2 UNMARRIED 85 NaN NaN NaN NaN NaN NaN NaN

key_complaints__code 163 13 other- heart 42 NaN NaN NaN NaN NaN NaN NaN

body_weight 163 NaN NaN NaN 39.5521 22.9404 3 16.5 43 59.5 85

body_height 163 NaN NaN NaN 133.607 38.1152 19 110.5 151 162 185

hr_pulse 163 NaN NaN NaN 90.9141 19.3998 58 76 90 102 140

bp__high 163 NaN NaN NaN 113.767 23.228 70 100 110 130 215

bp_low 163 NaN NaN NaN 71.5337 15.7195 40 60 70 80 140

rr 163 NaN NaN NaN 23.227 3.77173 12 22 24 24 42

past_medical_history_code 163 7 None 105 NaN NaN NaN NaN NaN NaN NaN

hb 163 NaN NaN NaN 13.2086 3.10009 8 11 13 14.5 26

urea 163 NaN NaN NaN 28.4724 17.9362 15 18 24 32 143

creatinine 163 NaN NaN NaN 0.718405 0.461912 0.1 0.3 0.6 1 2.5

mode_of_arrival 163 3 WALKED IN 138 NaN NaN NaN NaN NaN NaN NaN

state_at_the_time_of_arrival 163 1 ALERT 163 NaN NaN NaN NaN NaN NaN NaN

type_of_admsn 163 2 ELECTIVE 138 NaN NaN NaN NaN NaN NaN NaN

total_cost_to_hospital 163 NaN NaN NaN 213398 136952 46093 137288 169951 247792 887350

total_amount_billed_to_the_patient 163 NaN NaN NaN 191525 116874 43641 150000 150000 209675 944819

concession 163 NaN NaN NaN 16100.8 20383.7 0 0 2196 37500 123132

actual_receivable_amount 163 NaN NaN NaN 178482 123157 31000 112500 137000 203722 848397

total_length_of_stay 163 NaN NaN NaN 12.0061 5.46763 3 8 11 14 41

length_of_stay___icu 163 NaN NaN NaN 3.84049 3.85049 0 1 2 5 22

length_of_stay__ward 163 NaN NaN NaN 8.18405 3.80908 0 6 7 10 22

implant_used 163 2 N 126 NaN NaN NaN NaN NaN NaN NaN

cost_of_implant 163 NaN NaN NaN 10137.3 24193.8 0 0 0 0 196848

Get numeric features from the data and find the corelation amongst numeric features

In [12]: numerical_features = [x for x in raw_df.select_dtypes(include=[np.number])]

numerical_features

Out[12]: ['age',
'body_weight',
'body_height',
'hr_pulse',
'bp__high',
'bp_low',
'rr',
'hb',
'urea',
'creatinine',
'total_cost_to_hospital',
'total_amount_billed_to_the_patient',
'concession',
'actual_receivable_amount',
'total_length_of_stay',
'length_of_stay___icu',
'length_of_stay__ward',
'cost_of_implant']

In [13]: numerical_features_df = raw_df.select_dtypes(include=[np.number])

numerical_features_df.corr()

Out[13]:
age body_weight body_height hr_pulse bp__high bp_low rr hb urea creatinine total_cost_to

age 1.000000 0.843398 0.722565 -0.451244 0.586568 0.465456 -0.234808 -0.218499 0.285690 0.708491

body_weight 0.843398 1.000000 0.846622 -0.534041 0.593387 0.482086 -0.307728 -0.147971 0.222444 0.714043

body_height 0.722565 0.846622 1.000000 -0.484088 0.511962 0.434896 -0.295007 -0.062932 0.225537 0.620786

hr_pulse -0.451244 -0.534041 -0.484088 1.000000 -0.291634 -0.207449 0.373234 0.099655 -0.024116 -0.334538

bp__high 0.586568 0.593387 0.511962 -0.291634 1.000000 0.772989 -0.083097 -0.083930 0.096395 0.443001

bp_low 0.465456 0.482086 0.434896 -0.207449 0.772989 1.000000 -0.015695 0.034689 0.043500 0.319224

rr -0.234808 -0.307728 -0.295007 0.373234 -0.083097 -0.015695 1.000000 0.035520 0.063190 -0.158310

hb -0.218499 -0.147971 -0.062932 0.099655 -0.083930 0.034689 0.035520 1.000000 -0.096701 -0.227718

urea 0.285690 0.222444 0.225537 -0.024116 0.096395 0.043500 0.063190 -0.096701 1.000000 0.639180

creatinine 0.708491 0.714043 0.620786 -0.334538 0.443001 0.319224 -0.158310 -0.227718 0.639180 1.000000

total_cost_to_hospital 0.499186 0.449536 0.390078 -0.060195 0.217561 0.211650 0.045726 -0.094229 0.280680 0.516058

total_amount_billed_to_the_patient 0.499330 0.446368 0.418448 -0.057116 0.226300 0.199455 0.069940 -0.101410 0.283243 0.499464

concession -0.387066 -0.429783 -0.309462 0.199744 -0.294828 -0.265444 0.195671 0.173086 -0.073098 -0.274000

actual_receivable_amount 0.549550 0.524614 0.473479 -0.103888 0.281007 0.262555 0.039106 -0.118508 0.283019 0.523746

total_length_of_stay 0.345171 0.178323 0.114701 0.009433 0.121619 0.107979 0.170249 -0.024840 0.236011 0.354600

length_of_stay___icu 0.494728 0.382562 0.277546 -0.080921 0.189863 0.141541 0.051388 -0.131131 0.254400 0.486857

length_of_stay__ward -0.013214 -0.133412 -0.116847 0.097868 -0.025814 0.007834 0.195577 0.104414 0.083921 0.016657

cost_of_implant 0.148869 0.277878 0.299271 -0.044194 -0.016220 0.061073 0.051949 -0.070642 0.247417 0.198562

In [14]: categorical_features = [x for x in raw_df.select_dtypes(include=[np.object])]

categorical_features

Out[14]: ['gender',
'marital_status',
'key_complaints__code',
'past_medical_history_code',
'mode_of_arrival',
'state_at_the_time_of_arrival',
'type_of_admsn',
'implant_used']

2. Summarizing the dataset

Create a new data frame and store the raw data copy. This is being done to have a copy of the raw data intact for further manipulation if needed. The dropna()
function is used for row wise deletion of missing value. The axis = 0 means row-wise, 1 means column wise.

In [15]: filter_df = raw_df.dropna(axis=0, how='any', thresh=None,

subset=None, inplace=False)

list(filter_df.columns )

Out[15]: ['age',
'gender',
'marital_status',
'key_complaints__code',
'body_weight',
'body_height',
'hr_pulse',
'bp__high',
'bp_low',
'rr',
'past_medical_history_code',
'hb',
'urea',
'creatinine',
'mode_of_arrival',
'state_at_the_time_of_arrival',
'type_of_admsn',
'total_cost_to_hospital',
'total_amount_billed_to_the_patient',
'concession',
'actual_receivable_amount',
'total_length_of_stay',
'length_of_stay___icu',
'length_of_stay__ward',
'implant_used',
'cost_of_implant']

In [16]: filter_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163 entries, 0 to 162
Data columns (total 26 columns):
age 163 non-null float64
gender 163 non-null object
marital_status 163 non-null object
key_complaints__code 163 non-null object
body_weight 163 non-null int64
body_height 163 non-null int64
hr_pulse 163 non-null int64
bp__high 163 non-null int64
bp_low 163 non-null int64
rr 163 non-null int64
past_medical_history_code 163 non-null object
hb 163 non-null int64
urea 163 non-null int64
creatinine 163 non-null float64
mode_of_arrival 163 non-null object
state_at_the_time_of_arrival 163 non-null object
type_of_admsn 163 non-null object
total_cost_to_hospital 163 non-null float64
total_amount_billed_to_the_patient 163 non-null int64
concession 163 non-null int64
actual_receivable_amount 163 non-null int64
total_length_of_stay 163 non-null int64
length_of_stay___icu 163 non-null int64
length_of_stay__ward 163 non-null int64
implant_used 163 non-null object
cost_of_implant 163 non-null int64
dtypes: float64(3), int64(15), object(8)
memory usage: 34.4+ KB

We will first start by printing the unique labels in categorical features

In [17]: for f in categorical_features:

print("\nThe unique labels in {} is {}\n".format(f, filter_df[f].unique()))
print("The values in {} is \n{}\n".format(f, filter_df[f].value_counts()))

The unique labels in gender is ['M' 'F']

The values in gender is

M 110
F 53
Name: gender, dtype: int64

The unique labels in marital_status is ['MARRIED' 'UNMARRIED']

The values in marital_status is

UNMARRIED 85
MARRIED 78
Name: marital_status, dtype: int64

The unique labels in key_complaints__code is ['other- heart' 'CAD-DVD' 'CAD-TVD' 'RHD' 'CAD-SVD' 'other- respiratory'
'ACHD' 'other-tertalogy' 'other-nervous' 'PM-VSD' 'OS-ASD' 'CAD-VSD'
'other-general']

The values in key_complaints__code is

other- heart 42
CAD-DVD 26
CAD-TVD 22
RHD 19
ACHD 16
OS-ASD 13
other-tertalogy 11
other- respiratory 5
PM-VSD 4
CAD-SVD 2
other-nervous 1
other-general 1
CAD-VSD 1
Name: key_complaints__code, dtype: int64

The unique labels in past_medical_history_code is ['None' 'Diabetes2' 'hypertension1' 'hypertension3' 'hypertension2'

'Diabetes1' 'other']

The values in past_medical_history_code is

None 105
hypertension1 16
other 14
Diabetes1 9
Diabetes2 9
hypertension2 7
hypertension3 3
Name: past_medical_history_code, dtype: int64

The unique labels in mode_of_arrival is ['AMBULANCE' 'WALKED IN' 'TRANSFERRED']

The values in mode_of_arrival is

WALKED IN 138
AMBULANCE 23
TRANSFERRED 2
Name: mode_of_arrival, dtype: int64

The unique labels in state_at_the_time_of_arrival is ['ALERT']

The values in state_at_the_time_of_arrival is

ALERT 163
Name: state_at_the_time_of_arrival, dtype: int64

The unique labels in type_of_admsn is ['EMERGENCY' 'ELECTIVE']

The values in type_of_admsn is

ELECTIVE 138
EMERGENCY 25
Name: type_of_admsn, dtype: int64

The unique labels in implant_used is ['Y' 'N']

The values in implant_used is

N 126
Y 37
Name: implant_used, dtype: int64

Clubbing some of the feature labels together

In [18]: filter_df['past_medical_history_code']=np.where(
(filter_df['past_medical_history_code'] =='hypertension1') |
(filter_df['past_medical_history_code'] =='hypertension2') |
(filter_df['past_medical_history_code'] =='hypertension3'),
'hypertension', filter_df['past_medical_history_code'])

filter_df['past_medical_history_code']=np.where(
(filter_df['past_medical_history_code'] =='Diabetes1') |
(filter_df['past_medical_history_code'] =='Diabetes2'),
'diabetes', filter_df['past_medical_history_code'])

filter_df['key_complaints__code']=np.where(
(filter_df['key_complaints__code'] =='other- respiratory') |
(filter_df['key_complaints__code'] =='PM-VSD') |
(filter_df['key_complaints__code'] =='CAD-SVD') |
(filter_df['key_complaints__code'] =='CAD-VSD') |
(filter_df['key_complaints__code'] =='other-nervous') |
(filter_df['key_complaints__code'] =='other-general'),
'others', filter_df['key_complaints__code'])

#filter_df.past_medical_history_code.value_counts()

We will use groupby function of pandas to summarize numerical features by each categorical feature.

In [19]: def group_by (categorical_features):

std = filter_df.groupby(categorical_features).std()
mean = filter_df.groupby(categorical_features).mean()
return std, mean

Call the above function to group the numeric value by gender and marital_status

In [20]: s,m =group_by('gender')

s
m

Out[20]:
age body_weight body_height hr_pulse bp__high bp_low rr hb urea creatinine total_cost_to_hospital total_amount_b

gender

F 23.770187 18.770461 39.486444 19.231781 21.427181 15.639002 2.957242 3.479256 21.327497 0.387486 103066.628024

M 27.320604 23.770195 36.038124 19.539624 23.704886 15.520658 4.119321 2.916763 16.152084 0.479554 149136.095813

Out[20]:
age body_weight body_height hr_pulse bp__high bp_low rr hb urea creatinine total_cost_to_hospital total_amount

gender

F 24.45283 31.301887 120.886792 92.150943 108.377358 67.867925 23.283019 13.169811 29.056604 0.571698 185361.458113

M 35.05300 43.527273 139.736364 90.318182 116.363636 73.300000 23.200000 13.227273 28.190909 0.789091 226907.024273

Calculating BMI

In [21]: filter_df['bmi'] = filter_df.body_weight/(np.power((filter_df.body_height/100),2))

3. Visualizing the Data

Plot can be done using the callable functions of

1. pandas library (http://pandas.pydata.org/pandas-docs/stable/visualization.html)

2. matplotlib library (https://matplotlib.org/) or
3. seaborn library (https://seaborn.pydata.org/) which is based on matplotlib and provides interface for drawing attractive statistical graphics.

3a. Visualizing the Data using pandas

In [22]: def hist_plot(data, group_by, xlabel,ylabel):

pd.crosstab(data,group_by).plot(kind='hist')
plt.xlabel(xlabel, size = 14)
plt.ylabel(ylabel, size = 14)
plt.title('Plot', size = 18)
plt.grid(True)
x1,x2,y1,y2 = plt.axis()
plt.axis((0,x2,y1,y2))
plt.show()
#plt.subplot(1, 2)

In [23]: numerical_features_set = ['age','rr']

categorical_features_set = ['gender','marital_status']

for c in categorical_features_set:
for n in numerical_features_set:
hist_plot(filter_df[n], filter_df[c], n,c)

Model Approach 2: Without dummy variable coding

In [24]: import statsmodels.formula.api as smf

To print the name of all the models in any library

In [25]: #dir(smf)

In [26]: X_features = [x for x in filter_df if x not in ['body_weight','body_height',

'creatinine','state_at_the_time_of_arrival',
'total_amount_billed_to_the_patient','concession',
'actual_receivable_amount','total_length_of_stay',
'length_of_stay___icu','length_of_stay__ward']]

In [27]: new_df = filter_df.filter(X_features, axis =1)

new_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163 entries, 0 to 162
Data columns (total 17 columns):
age 163 non-null float64
gender 163 non-null object
marital_status 163 non-null object
key_complaints__code 163 non-null object
hr_pulse 163 non-null int64
bp__high 163 non-null int64
bp_low 163 non-null int64
rr 163 non-null int64
past_medical_history_code 163 non-null object
hb 163 non-null int64
urea 163 non-null int64
mode_of_arrival 163 non-null object
type_of_admsn 163 non-null object
total_cost_to_hospital 163 non-null float64
implant_used 163 non-null object
cost_of_implant 163 non-null int64
bmi 163 non-null float64
dtypes: float64(3), int64(7), object(7)
memory usage: 22.9+ KB

In [28]: from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split( new_df, test_size = 0.2, random_state = 42)

Writing the formula with the required set of variables to be used in model building. Formula takes the form as Y~X.

In [29]: pass_formula = 'total_cost_to_hospital ~ \

C(gender) + \
C(marital_status) + \
C(key_complaints__code) + \
C(past_medical_history_code) + \
C(mode_of_arrival) + \
C(type_of_admsn)+ \
C(implant_used) + \
age + hr_pulse + bp__high + bp_low + \
rr +hb + urea + cost_of_implant + bmi'

In [30]: regression_model = smf.ols(formula=pass_formula, data=train_df).fit()

regression_model.summary()

Out[30]:
OLS Regression Results

Dep. Variable: total_cost_to_hospital R-squared: 0.567

Model: OLS Adj. R-squared: 0.463

Method: Least Squares F-statistic: 5.447

Date: Tue, 21 Jul 2020 Prob (F-statistic): 3.14e-10

Time: 11:38:56 Log-Likelihood: -1666.9

No. Observations: 130 AIC: 3386.

Df Residuals: 104 BIC: 3460.

Df Model: 25

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept -2.238e+05 1.66e+05 -1.345 0.182 -5.54e+05 1.06e+05

C(gender)[T.M] 1.296e+04 2.16e+04 0.599 0.550 -2.99e+04 5.58e+04

C(marital_status)[T.UNMARRIED] 6.486e+04 4.41e+04 1.470 0.145 -2.27e+04 1.52e+05

C(key_complaints__code)[T.CAD-DVD] 1.433e+05 4.78e+04 2.999 0.003 4.85e+04 2.38e+05

C(key_complaints__code)[T.CAD-TVD] 1.238e+05 4.93e+04 2.508 0.014 2.59e+04 2.22e+05

C(key_complaints__code)[T.OS-ASD] 4.156e+04 4.4e+04 0.946 0.347 -4.56e+04 1.29e+05

C(key_complaints__code)[T.RHD] -5.186e+04 5e+04 -1.037 0.302 -1.51e+05 4.73e+04

C(key_complaints__code)[T.other- heart] 4.871e+04 3.52e+04 1.384 0.169 -2.11e+04 1.19e+05

C(key_complaints__code)[T.other-tertalogy] 5.896e+04 4.68e+04 1.259 0.211 -3.39e+04 1.52e+05

C(key_complaints__code)[T.others] 7333.3118 4.48e+04 0.164 0.870 -8.15e+04 9.61e+04

C(past_medical_history_code)[T.diabetes] 3.587e+04 4.13e+04 0.869 0.387 -4.6e+04 1.18e+05

C(past_medical_history_code)[T.hypertension] -3.264e+04 3.3e+04 -0.989 0.325 -9.81e+04 3.28e+04

C(past_medical_history_code)[T.other] -3.735e+04 3.56e+04 -1.048 0.297 -1.08e+05 3.33e+04

C(mode_of_arrival)[T.TRANSFERRED] 1.257e+05 1.57e+05 0.799 0.426 -1.86e+05 4.38e+05

C(mode_of_arrival)[T.WALKED IN] 1.14e+05 1.1e+05 1.032 0.304 -1.05e+05 3.33e+05

C(type_of_admsn)[T.EMERGENCY] 1.394e+05 1.08e+05 1.287 0.201 -7.55e+04 3.54e+05

C(implant_used)[T.Y] 1.316e+05 5.1e+04 2.580 0.011 3.05e+04 2.33e+05

age 2013.2379 973.038 2.069 0.041 83.668 3942.808

hr_pulse 937.4968 567.707 1.651 0.102 -188.288 2063.281

bp__high 319.9341 741.341 0.432 0.667 -1150.174 1790.042

bp_low -1010.1327 1001.742 -1.008 0.316 -2996.625 976.359

rr 3040.9463 2717.108 1.119 0.266 -2347.181 8429.074

hb -636.4407 3220.833 -0.198 0.844 -7023.474 5750.592

urea 393.4563 653.097 0.602 0.548 -901.660 1688.572

cost_of_implant 1.2582 0.978 1.286 0.201 -0.681 3.198

bmi -91.9391 249.000 -0.369 0.713 -585.716 401.838

Omnibus: 82.903 Durbin-Watson: 1.987

Prob(Omnibus): 0.000 Jarque-Bera (JB): 626.472

Skew: 2.096 Prob(JB): 9.19e-137

Kurtosis: 12.904 Cond. No. 5.83e+05

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.83e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Find the significant variables

In [31]: def get_significant_vars (modelobject):
var_p_vals_df = pd.DataFrame(modelobject.pvalues)
var_p_vals_df['vars'] = var_p_vals_df.index
var_p_vals_df.columns = ['pvals', 'vars']
return list(var_p_vals_df[var_p_vals_df.pvals <= 0.05]['vars'])

In [32]: significant_vars = get_significant_vars(regression_model)

significant_vars

Out[32]: ['C(key_complaints__code)[T.CAD-DVD]',
'C(key_complaints__code)[T.CAD-TVD]',
'C(implant_used)[T.Y]',
'age']

Model Evaluation

1. The prediction on train data.

Two ways to precit the outcome on the train set

Use predict function of the model object

Use get_prediction function of the model object

For the model with dummy variable coding explicetely done, we need to add the constant term to the test set. For the model with dummy variable coding
carried out automatically, there is no need to add the constant term to the test set.

Here is the output with the model with no dummy variable coding

In [33]: predict_train_df = regression_model.predict((train_df))

predict_train_df.head()

predict_train_df = regression_model.get_prediction(train_df)
predict_train_df.predicted_mean[0:5]

Out[33]: 84 135892.598214
2 333510.462816
94 183522.853799
45 216244.232684
42 268984.100132
dtype: float64

Out[33]: array([135892.59821423, 333510.46281619, 183522.85379855, 216244.23268421,

268984.10013156])

2. Model Evaluation - heteroscedasticity

In [34]: pred_val = regression_model.fittedvalues.copy()

true_val = train_df['total_cost_to_hospital'].values.copy()
residual = true_val - pred_val

In [35]: plt.scatter(residual, pred_val)

Out[35]: <matplotlib.collections.PathCollection at 0x7fec3e704ef0>

3. Model Evaluation - Test for Normality

In [36]: import statsmodels.api as sm

normality_plot = sm.qqplot(residual,line = 'r')

4. The prediction on test data.

The prediction can be carried out by defining functions as well. Below is one such instance wherein a function is defined and is used for prediction

In [37]: def get_predictions ( test_actual, model, test_data ):

y_pred_df = pd.DataFrame( { 'actual': test_actual,
'predicted': model.get_prediction((test_data)).predicted_mean})
return y_pred_df

In [38]: predict_test_df = get_predictions( test_df.total_cost_to_hospital, regression_model, test_df)

predict_test_df.head()

Out[38]:
actual predicted

135 73218.0 174963.121210

115 144134.0 264900.398145

131 109117.0 96844.282918

55 220519.0 255575.535725

95 140545.0 126739.185201

End of Document

Trailblazer Medicare Audit Tool
100% (10)
Trailblazer Medicare Audit Tool
4 pages
Week002 LabEx
100% (2)
Week002 LabEx
4 pages
Convex Optimization in Normed Spaces - Theory Methods and Examples
No ratings yet
Convex Optimization in Normed Spaces - Theory Methods and Examples
132 pages
E and M Documentation Tool
100% (13)
E and M Documentation Tool
3 pages
Medical Coding: a QuickStudy Laminated Reference Guide
From Everand
Medical Coding: a QuickStudy Laminated Reference Guide
Shelley C Safian
No ratings yet
2.4 Writing Linear Equations Practice
No ratings yet
2.4 Writing Linear Equations Practice
1 page
Physical Assessment Tool
100% (1)
Physical Assessment Tool
21 pages
Initial Patient Assessment
No ratings yet
Initial Patient Assessment
8 pages
CASE-STUDY-FINAL
No ratings yet
CASE-STUDY-FINAL
49 pages
Ponr Inp
No ratings yet
Ponr Inp
23 pages
PA TOOL Virtual Edition
No ratings yet
PA TOOL Virtual Edition
16 pages
Patients Demographic Data: Nursing Assessment Guide
No ratings yet
Patients Demographic Data: Nursing Assessment Guide
9 pages
PA Tool Revised 2019 111
No ratings yet
PA Tool Revised 2019 111
24 pages
Final Mental
No ratings yet
Final Mental
10 pages
Partial Hip Replacement Edited
No ratings yet
Partial Hip Replacement Edited
68 pages
Age 76 Years Old: 1. Tobacco 2. Alcohol 3. OTC-drugs/ Non-Prescription Drugs
100% (1)
Age 76 Years Old: 1. Tobacco 2. Alcohol 3. OTC-drugs/ Non-Prescription Drugs
42 pages
Case RPORT Hsy
100% (1)
Case RPORT Hsy
12 pages
Gordan's pattern of assessment (1)
No ratings yet
Gordan's pattern of assessment (1)
5 pages
CITU-CNAHS-HEALTH-ASSESSMENT
No ratings yet
CITU-CNAHS-HEALTH-ASSESSMENT
16 pages
Walking Oral Rounds Patient Assessment and Plan of Care NURB 266
No ratings yet
Walking Oral Rounds Patient Assessment and Plan of Care NURB 266
2 pages
Week 1: Item Points
No ratings yet
Week 1: Item Points
112 pages
PA-TOOL
No ratings yet
PA-TOOL
22 pages
TEACHIO OET Writing Short Forms
No ratings yet
TEACHIO OET Writing Short Forms
5 pages
NSG ASSESSMENT TOOL TorralbaAubreyDyniseC.,sectionF
No ratings yet
NSG ASSESSMENT TOOL TorralbaAubreyDyniseC.,sectionF
4 pages
Journal Club On DUE in ICU
No ratings yet
Journal Club On DUE in ICU
13 pages
Assessment Sheet
No ratings yet
Assessment Sheet
7 pages
Assessment Sheet 2022
No ratings yet
Assessment Sheet 2022
7 pages
BEATRICE ACHEAMPONG- Hyperglycemia in Type 2 Diabetes Mellitus .... (1)-1-1-1_110200-2
No ratings yet
BEATRICE ACHEAMPONG- Hyperglycemia in Type 2 Diabetes Mellitus .... (1)-1-1-1_110200-2
80 pages
Complete Assessment Sheet - 221027 - 061955
No ratings yet
Complete Assessment Sheet - 221027 - 061955
5 pages
طباعه
No ratings yet
طباعه
9 pages
TPR Sheet - Bito-On
100% (1)
TPR Sheet - Bito-On
2 pages
TPR Sheet Form
No ratings yet
TPR Sheet Form
2 pages
sumayya case study
No ratings yet
sumayya case study
37 pages
Pharmacist Workup of Drug Therapy in Pharmaceutical Care: Problem Oriented Pharmacist Record
100% (1)
Pharmacist Workup of Drug Therapy in Pharmaceutical Care: Problem Oriented Pharmacist Record
20 pages
Health Examination Record
No ratings yet
Health Examination Record
4 pages
Sule Tamar's work
No ratings yet
Sule Tamar's work
43 pages
TPR Sheet
67% (3)
TPR Sheet
1 page
TPR Example
No ratings yet
TPR Example
1 page
Learning Activity For Formulating Nursing History
No ratings yet
Learning Activity For Formulating Nursing History
6 pages
pilot study excel performa
No ratings yet
pilot study excel performa
5 pages
Severe Malaria
No ratings yet
Severe Malaria
76 pages
Critical Clinicals Picmonic 70 Cheat Sheets for Nursing School 2021 Digital Download
100% (2)
Critical Clinicals Picmonic 70 Cheat Sheets for Nursing School 2021 Digital Download
83 pages
Date: Time: Admitting Consultant:: PILOT Version 3 Surgical and Trauma Clerking Proforma
No ratings yet
Date: Time: Admitting Consultant:: PILOT Version 3 Surgical and Trauma Clerking Proforma
9 pages
Functional Health Patterns Assessment Tool
No ratings yet
Functional Health Patterns Assessment Tool
6 pages
Sheets 1 (2) (1)
No ratings yet
Sheets 1 (2) (1)
23 pages
5 Pews Scottish
No ratings yet
5 Pews Scottish
40 pages
Proforma For Provision of Medicine To The Poor/Deserving Patients in Government Hospitals
No ratings yet
Proforma For Provision of Medicine To The Poor/Deserving Patients in Government Hospitals
1 page
Pa Tool Mihp
No ratings yet
Pa Tool Mihp
23 pages
Patients Profile Regalado
No ratings yet
Patients Profile Regalado
5 pages
CARE PLAN GUIDE Latest PDF 2
No ratings yet
CARE PLAN GUIDE Latest PDF 2
168 pages
Nursing Assessment Guide
No ratings yet
Nursing Assessment Guide
6 pages
updated pt profile form (2)
No ratings yet
updated pt profile form (2)
3 pages
Care Map 2012
No ratings yet
Care Map 2012
2 pages
Physical Assessment Worksheet
No ratings yet
Physical Assessment Worksheet
1 page
SBAR
No ratings yet
SBAR
1 page
Patient Classification Form
No ratings yet
Patient Classification Form
3 pages
Institute of Health Science Suaka Insan
No ratings yet
Institute of Health Science Suaka Insan
14 pages
Nursing Care Plan
No ratings yet
Nursing Care Plan
12 pages
CS On Acute Bronchitis ND HPN
No ratings yet
CS On Acute Bronchitis ND HPN
3 pages
Autar DVT Risk Assessment Scale
No ratings yet
Autar DVT Risk Assessment Scale
2 pages
BSC CASE PRESENTATION FORMAT
No ratings yet
BSC CASE PRESENTATION FORMAT
29 pages
Ambulatory Blood Pressure Monitoring: Practical Insights: Medical Series
From Everand
Ambulatory Blood Pressure Monitoring: Practical Insights: Medical Series
Taha Othmane
No ratings yet
Essential Guide to Blood Groups
From Everand
Essential Guide to Blood Groups
Geoff Daniels
No ratings yet
Jfet Spice Data CTC 036 Interfet
No ratings yet
Jfet Spice Data CTC 036 Interfet
16 pages
Hypersonic Flow Part 1
No ratings yet
Hypersonic Flow Part 1
40 pages
PA
No ratings yet
PA
8 pages
USN Dayananda Sagar College of Engineering: (An Autonomous Institute Affiliated To VTU, Belagavi)
No ratings yet
USN Dayananda Sagar College of Engineering: (An Autonomous Institute Affiliated To VTU, Belagavi)
2 pages
Homework2 Ans
No ratings yet
Homework2 Ans
5 pages
E2 RDD Extensions
No ratings yet
E2 RDD Extensions
34 pages
Model Question Paper-I With Effect From 2021: Second Semester B.E Degree Examination
No ratings yet
Model Question Paper-I With Effect From 2021: Second Semester B.E Degree Examination
2 pages
Spell Number in MS Excel
No ratings yet
Spell Number in MS Excel
3 pages
C 02 Algebra and Equations
No ratings yet
C 02 Algebra and Equations
50 pages
Engineering Dynamics Fundamentals And Applications M Rashad Islam download
100% (2)
Engineering Dynamics Fundamentals And Applications M Rashad Islam download
83 pages
File Handling Assignments
0% (1)
File Handling Assignments
5 pages
MTech Project Management W.E.F. 2015-16
No ratings yet
MTech Project Management W.E.F. 2015-16
27 pages
Research Module
No ratings yet
Research Module
53 pages
Hw2 - Raymond Von Mizener - Chirag Mahapatra
No ratings yet
Hw2 - Raymond Von Mizener - Chirag Mahapatra
13 pages
Acceleration: Sunil Kumar Singh
No ratings yet
Acceleration: Sunil Kumar Singh
6 pages
Maths Project
No ratings yet
Maths Project
5 pages
MLX90316 Datasheet Melexis PDF
No ratings yet
MLX90316 Datasheet Melexis PDF
48 pages
BM2-Chapter-5-Forecasting
No ratings yet
BM2-Chapter-5-Forecasting
24 pages
Computer Graphics Detailed PYQs Solutions
No ratings yet
Computer Graphics Detailed PYQs Solutions
2 pages
MATE2A2 Learning Guide 2024
No ratings yet
MATE2A2 Learning Guide 2024
18 pages
Queens Problem
No ratings yet
Queens Problem
3 pages
Verilog Dataflow Modeling: Expt. No. 3 (Fourth Week Lab1 and Lab2)
No ratings yet
Verilog Dataflow Modeling: Expt. No. 3 (Fourth Week Lab1 and Lab2)
5 pages
Module 1 1
No ratings yet
Module 1 1
16 pages
Chapter 1.3 and 1 .4 Practice Quiz
No ratings yet
Chapter 1.3 and 1 .4 Practice Quiz
9 pages
Analytical Model For The Estimation of Leak Locati
No ratings yet
Analytical Model For The Estimation of Leak Locati
9 pages
File Access in VBA
No ratings yet
File Access in VBA
4 pages
Constants, Variables, Terms, Algebraic Expressions, and Numerical and Literal Coefficients
No ratings yet
Constants, Variables, Terms, Algebraic Expressions, and Numerical and Literal Coefficients
6 pages