Experiment 1
Experiment 1
Objective:
● Explore the Medical Dataset suitable for linear/ logistic regression problem
● Explore the pattern from the dataset and apply suitable algorithm
System Requirements:
Linux OS with Python and libraries or R or windows with MATLAB
Theory:
Regression analysis has several types, each serving different purposes based on the nature of the
data and the relationship between variables. The main types include:
1. Linear Regression: This is the simplest form, where the relationship between the
dependent and independent variables is modeled as a straight line. It is used when the
data shows a linear trend.
2. Multiple Linear Regression: An extension of linear regression, this involves multiple
independent variables to predict the dependent variable. It is significant for understanding
how several factors simultaneously affect an outcome.
3. Polynomial Regression: This type models the relationship as an nth-degree polynomial,
allowing for curved relationships between variables. It is useful when the data exhibits
nonlinear trends.
4. Logistic Regression: Although named "regression," it is used for binary classification
problems, modeling the probability that a given input belongs to a specific class. It is
significant in fields like medical diagnostics and social sciences.
5. Ridge and Lasso Regression: These are regularization techniques applied to linear
regression to prevent overfitting by adding a penalty to the magnitude of coefficients.
Ridge regression penalizes the sum of squared coefficients, while Lasso penalizes the
sum of absolute coefficients, also allowing for feature selection.
6. Quantile Regression: Instead of modeling the mean of the dependent variable, quantile
regression estimates the median or other quantiles. It is significant in cases where the
relationship between variables varies across different points of the distribution.
Each type of regression is significant in its own way, allowing analysts and researchers to choose
the most appropriate model for their specific data characteristics and research objectives.
Datasets:
For Linear Regression:
Patient Records of a Particular Hospital
(https://huggingface.co/datasets/Nicolybgs/healthcare_data)
For Logistic Regression:
Diabetes Dataset
(https://www.kaggle.com/datasets/mathchi/diabetes-data-set)
ALGORITHM:
Step 1: Create a sample dataset with multiple independent variables and one dependent
variable (Y).
Step 2: The data is split into training and testing sets using the train_test_split function.
Step3: Different regression model is created and fitted to the training data.
Step4: Predictions are made on the test set.
Step5: The model is evaluated using metrics like Mean Absolute Error, Mean Squared Error,
and Root Mean Squared Error.
Step6: Finally, the coefficients and intercept of the regression equation are printed.
Code:
For Linear Regression:
(Colab Notebook - LinearRegression.ipynb)
Task:
To predict the number of days an admitted patient will stay in a particular hospital depending on
the patient’s disease severeness, hospital department, doctor under consideration, ward, etc
Imports-
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error,
r2_score
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
df =
pd.read_csv("hf://datasets/Nicolybgs/healthcare_data/healthcare_data.csv")
Preprocessing-
Getting all columns’ information and their respective unique values
for column in df.columns:
print(column, ' ---> ', df[column].unique())
df['Age'] = df['Age'].apply(range_to_midpoint)
Encoding all the columns with categorical outputs using hot-bit encoding scheme, keeping the
columns with numeric values intact
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
categorical_transformer = OneHotEncoder(sparse=False)
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_columns)
],
remainder='passthrough'
)
df = preprocessor.fit_transform(df)
df = pd.DataFrame(df, columns=(
list(preprocessor.named_transformers_['cat'].get_feature_names_out(categor
ical_columns)) +
numeric_columns
))
Analysis-
Finding out the variables that are highly correlated with the output variable
corr_matrix = df.corr()
plt.figure(figsize=(12, 12))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1,
fmt='.2f')
plt.title('Correlation Matrix Heatmap')
plt.show()
Dropping the attributes that have almost no correlation with the concerned variable
df_imp = df.drop(columns = list(set(list(df.columns)) - set(['Age',
'Stay (in days)',
'Department_TB & Chest
disease',
'Department_anesthesia',
'Department_gynecology',
'Department_radiotherapy',
'Department_surgery',
'gender_Female',
'gender_Male',
'gender_Other',
'Ward_Facility_Code_A',
'Ward_Facility_Code_B', 'Ward_Facility_Code_C',
'Ward_Facility_Code_D',
'Ward_Facility_Code_E', 'Ward_Facility_Code_F', 'doctor_name_Dr
Isaac',
'doctor_name_Dr John', 'doctor_name_Dr Mark', 'doctor_name_Dr
Nathan',
'doctor_name_Dr Olivia', 'doctor_name_Dr Sam', 'doctor_name_Dr
Sarah',
'doctor_name_Dr Simon', 'doctor_name_Dr Sophia',])))
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print('-' * 40)
Training and Testing: Model B (Considering attributes showing strong correlation with output)
X = df_imp.drop('Stay (in days)', axis=1)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print('-' * 40)
Imports-
import numpy as np
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Datasets/diabetes.csv')
Preprocessing-
Normalizing the data
scaler_minmax = MinMaxScaler()
df = pd.DataFrame(scaler_minmax.fit_transform(df), columns=df.columns)
Analysis-
Finding the extent of correlations of independent variables with the dependent binary variable
corr_matrix = df.corr()
plt.figure(figsize=(10, 8))
plt.show()
Considering attributes that are strongly correlated with the outcome
df_imp = df[['Pregnancies', 'Glucose', 'BMI', 'Age', 'Outcome']]
y = df['Outcome']
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
Training and Testing: Model B (Considering attributes showing strong correlation with the
outcome)-
X = df_imp.drop('Outcome', axis=1)
y = df_imp['Outcome']
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
Output:
For Linear Regression:
Model A on two split ratios - 7:3, 8:2
Model B on two split ratios - 7:3, 8:2
For Logistic Regression:
Conclusion:
By performing this experiment, I was able to understand how regression analysis can be carried
out on healthcare datasets to predict certain information, either categorical or continuous.
Following are some of the observations regarding the models trained during this experiment-
● In case of Linear Regression, the two models trained did not show a significant amount of
change in their performance parameters when the train-test split ratio was altered.
However, one does see improvements in the model when the independent variables under
consideration are those that show positive correlation with the output variable (the
number of days an admitted patient will stay in the hospital).
● In case of Logistic Regression, the first model, trained on all the attributes, showed an
accuracy of 77.5% on a 7:3 split ratio, while the accuracy was around 80.5% on 8:2 split
ratio. While the analysis shows that only a few variables have a comparative strong
correlation with the outcome, training a model on only those variables reduces the
accuracy to 73.6%, implying that the variables with low correlation to outcome still drive
the outcome collectively.