0% found this document useful (0 votes)
15 views

Experiment 1

Uploaded by

mohammed.ansari
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Experiment 1

Uploaded by

mohammed.ansari
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Sardar Patel Institute of Technology,Mumbai

Department of Electronics and Telecommunication Engineering


B.E. Sem-VII- PE-IV (2024-2025)
IT 24 - AI in Healthcare

Experiment 1: Regression in Healthcare Dataset

Name: Mohammed Shanouf Valijan Ansari (2021300004) (Batch - M) Date: 14-08-2024

Objective:

● Write program for regression analysis for healthcare dataset.

● To demonstrate the working principle of regression techniques on medical data set


for building the model to classify/ predict using a new sample.
Outcomes:

● Explore the Medical Dataset suitable for linear/ logistic regression problem

● Explore the pattern from the dataset and apply suitable algorithm

System Requirements:
Linux OS with Python and libraries or R or windows with MATLAB
Theory:

Regression is a statistical method used to model the relationship between a dependent


variable (output) and one or more independent variables (inputs). Mathematically, in the
case of simple linear regression, this relationship is expressed as y=β0+β1x+ϵy, where y is the
dependent variable, x is the independent variable, β0 and β1\beta_1β1 are the coefficients
(intercept and slope, respectively), and ϵ represents the error term, accounting for the
deviation of observed values from the predicted ones. The goal is to determine the values of
β0\beta_0β0 and β1\beta_1β1 that minimize the sum of squared errors (SSE) between the
observed and predicted values of y. In multiple regression, the equation extends to
y=β0+β1x1+β2x2+⋯+βnxn+ϵ, involving multiple independent variables. The coefficients
are typically estimated using methods like Ordinary Least Squares (OLS), and the model's
accuracy is often assessed by metrics such as R-squared, which measures the proportion of
variance in the dependent variable explained by the independent variables.

Regression analysis has several types, each serving different purposes based on the nature of the
data and the relationship between variables. The main types include:
1. Linear Regression: This is the simplest form, where the relationship between the
dependent and independent variables is modeled as a straight line. It is used when the
data shows a linear trend.
2. Multiple Linear Regression: An extension of linear regression, this involves multiple
independent variables to predict the dependent variable. It is significant for understanding
how several factors simultaneously affect an outcome.
3. Polynomial Regression: This type models the relationship as an nth-degree polynomial,
allowing for curved relationships between variables. It is useful when the data exhibits
nonlinear trends.
4. Logistic Regression: Although named "regression," it is used for binary classification
problems, modeling the probability that a given input belongs to a specific class. It is
significant in fields like medical diagnostics and social sciences.
5. Ridge and Lasso Regression: These are regularization techniques applied to linear
regression to prevent overfitting by adding a penalty to the magnitude of coefficients.
Ridge regression penalizes the sum of squared coefficients, while Lasso penalizes the
sum of absolute coefficients, also allowing for feature selection.
6. Quantile Regression: Instead of modeling the mean of the dependent variable, quantile
regression estimates the median or other quantiles. It is significant in cases where the
relationship between variables varies across different points of the distribution.

Each type of regression is significant in its own way, allowing analysts and researchers to choose
the most appropriate model for their specific data characteristics and research objectives.

Image Depicting Linear Regression on One Input Variable

Datasets:
For Linear Regression:
Patient Records of a Particular Hospital
(https://huggingface.co/datasets/Nicolybgs/healthcare_data)
For Logistic Regression:
Diabetes Dataset
(https://www.kaggle.com/datasets/mathchi/diabetes-data-set)
ALGORITHM:

Step 1: Create a sample dataset with multiple independent variables and one dependent
variable (Y).
Step 2: The data is split into training and testing sets using the train_test_split function.
Step3: Different regression model is created and fitted to the training data.
Step4: Predictions are made on the test set.
Step5: The model is evaluated using metrics like Mean Absolute Error, Mean Squared Error,
and Root Mean Squared Error.
Step6: Finally, the coefficients and intercept of the regression equation are printed.

Code:
For Linear Regression:
(Colab Notebook - LinearRegression.ipynb)
Task:
To predict the number of days an admitted patient will stay in a particular hospital depending on
the patient’s disease severeness, hospital department, doctor under consideration, ward, etc

Imports-
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error,
r2_score
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
df =
pd.read_csv("hf://datasets/Nicolybgs/healthcare_data/healthcare_data.csv")

Preprocessing-
Getting all columns’ information and their respective unique values
for column in df.columns:
print(column, ' ---> ', df[column].unique())

Dropping irrelevant columns


df = df.drop(columns=['patientid'])

Assigning numeric values to ‘Age’ column


def range_to_midpoint(age_range):
start, end = age_range.split('-')
return (int(start) + int(end)) / 2

df['Age'] = df['Age'].apply(range_to_midpoint)

Encoding all the columns with categorical outputs using hot-bit encoding scheme, keeping the
columns with numeric values intact
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

categorical_columns = ['Department', 'gender', 'Type of Admission',


'Severity of Illness', 'Insurance', 'Ward_Facility_Code', 'doctor_name',
'health_conditions']
numeric_columns = ['Available Extra Rooms in Hospital', 'staff_available',
'Age', 'Visitors with Patient', 'Admission_Deposit', 'Stay (in days)']

categorical_transformer = OneHotEncoder(sparse=False)
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_columns)
],
remainder='passthrough'
)

df = preprocessor.fit_transform(df)

df = pd.DataFrame(df, columns=(

list(preprocessor.named_transformers_['cat'].get_feature_names_out(categor
ical_columns)) +
numeric_columns
))

Analysis-
Finding out the variables that are highly correlated with the output variable
corr_matrix = df.corr()

plt.figure(figsize=(12, 12))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1,
fmt='.2f')
plt.title('Correlation Matrix Heatmap')
plt.show()
Dropping the attributes that have almost no correlation with the concerned variable
df_imp = df.drop(columns = list(set(list(df.columns)) - set(['Age',
'Stay (in days)',
'Department_TB & Chest
disease',
'Department_anesthesia',
'Department_gynecology',
'Department_radiotherapy',
'Department_surgery',
'gender_Female',
'gender_Male',
'gender_Other',

'Ward_Facility_Code_A',
'Ward_Facility_Code_B', 'Ward_Facility_Code_C',
'Ward_Facility_Code_D',
'Ward_Facility_Code_E', 'Ward_Facility_Code_F', 'doctor_name_Dr
Isaac',
'doctor_name_Dr John', 'doctor_name_Dr Mark', 'doctor_name_Dr
Nathan',
'doctor_name_Dr Olivia', 'doctor_name_Dr Sam', 'doctor_name_Dr
Sarah',
'doctor_name_Dr Simon', 'doctor_name_Dr Sophia',])))

Training and Testing: Model A (Considering all attributes)


X = df.drop('Stay (in days)', axis=1)

y = df['Stay (in days)']

print("Model A (Considering All Attributes)")

ratios = [(0.3, '7:3'), (0.2, '8:2')]

for test_size, ratio in ratios:

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=test_size, random_state=35)

model = LinearRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

mae = mean_absolute_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

rmse = np.sqrt(mse)

print(f"Train-Test Ratio {ratio}:")

print(f"Mean Squared Error (MSE): {mse}")

print(f"Mean Absolute Error (MAE): {mae}")

print(f"R2 Score (R2): {r2}")

print(f"Root Mean Squared Error (RMSE): {rmse}")

print(f"Coefficients: {model.coef_}")

print(f"Intercept: {model.intercept_}")

print('-' * 40)

Training and Testing: Model B (Considering attributes showing strong correlation with output)
X = df_imp.drop('Stay (in days)', axis=1)

y = df_imp['Stay (in days)']

print("Model B (Considering Attributes Showing Strong Correlation to


Output)")

ratios = [(0.3, '7:3'), (0.2, '8:2')]

for test_size, ratio in ratios:

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=test_size, random_state=35)

model = LinearRegression()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

mae = mean_absolute_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

rmse = np.sqrt(mse)

print(f"Train-Test Ratio {ratio}:")

print(f"Mean Squared Error (MSE): {mse}")

print(f"Mean Absolute Error (MAE): {mae}")

print(f"R2 Score (R2): {r2}")

print(f"Root Mean Squared Error (RMSE): {rmse}")

print(f"Coefficients: {model.coef_}")

print(f"Intercept: {model.intercept_}")

print('-' * 40)

For Logistic Regression:


(Colab Notebook - LogisticRegression.ipynb)
Task:
To predict whether a particular person is diabetic or not depending on his/her glucose levels, skin
thickness, age, insulin level, blood pressure, BMI, etc

Imports-
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, precision_score, recall_score,


f1_score, confusion_matrix, classification_report

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('/content/drive/MyDrive/Datasets/diabetes.csv')

Preprocessing-
Normalizing the data
scaler_minmax = MinMaxScaler()

df = pd.DataFrame(scaler_minmax.fit_transform(df), columns=df.columns)

Analysis-
Finding the extent of correlations of independent variables with the dependent binary variable
corr_matrix = df.corr()

plt.figure(figsize=(10, 8))

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1,


fmt='.2f')

plt.title('Correlation Matrix Heatmap')

plt.show()
Considering attributes that are strongly correlated with the outcome
df_imp = df[['Pregnancies', 'Glucose', 'BMI', 'Age', 'Outcome']]

Training and Testing: Model A (Considering all attributes)-


X = df.drop('Outcome', axis=1)

y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

model = LogisticRegression()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)

print("Precision:", precision)

print("Recall:", recall)

print("F1 Score:", f1)

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

print("\nClassification Report:\n", classification_report(y_test, y_pred))

print(f"Coefficients: {model.coef_}")

print(f"Intercept: {model.intercept_}")

Training and Testing: Model B (Considering attributes showing strong correlation with the
outcome)-
X = df_imp.drop('Outcome', axis=1)

y = df_imp['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred)


recall = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)

print("Precision:", precision)

print("Recall:", recall)

print("F1 Score:", f1)

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

print("\nClassification Report:\n", classification_report(y_test, y_pred))

print(f"Coefficients: {model.coef_}")

print(f"Intercept: {model.intercept_}")

Output:
For Linear Regression:
Model A on two split ratios - 7:3, 8:2
Model B on two split ratios - 7:3, 8:2
For Logistic Regression:

Model A on 7:3 split ratio

Model A on 8:2 split ratio


Model B on 7:3 split ratio

Conclusion:

By performing this experiment, I was able to understand how regression analysis can be carried
out on healthcare datasets to predict certain information, either categorical or continuous.
Following are some of the observations regarding the models trained during this experiment-
● In case of Linear Regression, the two models trained did not show a significant amount of
change in their performance parameters when the train-test split ratio was altered.
However, one does see improvements in the model when the independent variables under
consideration are those that show positive correlation with the output variable (the
number of days an admitted patient will stay in the hospital).
● In case of Logistic Regression, the first model, trained on all the attributes, showed an
accuracy of 77.5% on a 7:3 split ratio, while the accuracy was around 80.5% on 8:2 split
ratio. While the analysis shows that only a few variables have a comparative strong
correlation with the outcome, training a model on only those variables reduces the
accuracy to 73.6%, implying that the variables with low correlation to outcome still drive
the outcome collectively.

You might also like