Hcin620 m6 Lab6 Hanifahmutesi-Finalproject
Hcin620 m6 Lab6 Hanifahmutesi-Finalproject
In this project we are tasked with predicting the stages of Chronic Kidney Disease based on Glomerular Filtration Rate (GFR). Information for
dataset is available in this link http://archive.ics.uci.edu/ml//datasets/Chronic_Kidney_Disease
In order to succeed in this nal you will be required to input most of the python code. Please review all the previous labs before you begin, and
read the instructions carefully.
Rather than using a question/answer format, we have commented the code cells with the notation #TODO This is a placeholder technique (a To
Do list, so to speak) commonly used in machine learning. Complete each #TODO task requested of you.
Good Luck!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
#ANSWER KEY TODO 2. Add a print command to acknowledge completion of the import.
print('Import complete')
Import complete
Read the csv le into a data frame, and use the name of the dataframe to print the rst and last 5 rows.
df = pd.read_csv('data-lab-6-ckd-courseproject.csv')
data = df.copy()
#ANSWER KEY TODO 4. Use the name of the dataframe to print the first and last five rows.
data
Red Blood
Blood Specific Pus Pus Cell Blood Serum
Age Albumin Sugar Blood Bacteria Glucose Sodium P
Pressure Gravity Cell clumps Urea Creatinine
Cells Random
0 48.0 70.0 1.005 4.0 0.0 normal abnormal present notpresent 117.0 56.0 3.8 111.0
1 53.0 90.0 1.020 2.0 0.0 abnormal abnormal present notpresent 70.0 107.0 7.2 114.0
2 63.0 70.0 1.010 3.0 0.0 abnormal abnormal present notpresent 380.0 60.0 2.7 131.0
3 68.0 80.0 1.010 3.0 2.0 normal abnormal present present 157.0 90.0 4.1 130.0
4 61.0 80.0 1.015 2.0 0.0 abnormal abnormal notpresent notpresent 173.0 148.0 3.9 135.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
153 55.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent 140.0 49.0 0.5 150.0
154 42.0 70.0 1.025 0.0 0.0 normal normal notpresent notpresent 75.0 31.0 1.2 141.0
155 12.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent 100.0 26.0 0.6 137.0
156 17.0 60.0 1.025 0.0 0.0 normal normal notpresent notpresent 114.0 50.0 1.0 135.0
157 58.0 80.0 1.025 0.0 0.0 normal normal notpresent notpresent 131.0 18.0 1.1 141.0
data.isnull().sum()
Age 0
Blood Pressure 0
Specific Gravity 0
Albumin 0
Sugar 0
Red Blood Cells 0
Pus Cell 0
Pus Cell clumps 0
Bacteria 0
Blood Glucose Random 0
Blood Urea 0
Serum Creatinine 0
Sodium 0
Potassium 0
Hemoglobin 0
Packed Cell Volume 0
White Blood Cell Count 0
Red Blood Cell Count 0
Hypertension 0
Diabetes Mellitus 0
Coronary Artery Disease 0
Appetite 0
Pedal Edema 0
Anemia 0
Class 0
dtype: int64
le=LabelEncoder()
data.drop_duplicates()
Red Blood
Blood Specific Pus Pus Cell Blood Serum
Age Albumin Sugar Blood Bacteria Glucose Sodium P
Pressure Gravity Cell clumps Urea Creatinine
Cells Random
0 48.0 70.0 1.005 4.0 0.0 normal abnormal present notpresent 117.0 56.0 3.8 111.0
1 53.0 90.0 1.020 2.0 0.0 abnormal abnormal present notpresent 70.0 107.0 7.2 114.0
2 63.0 70.0 1.010 3.0 0.0 abnormal abnormal present notpresent 380.0 60.0 2.7 131.0
3 68.0 80.0 1.010 3.0 2.0 normal abnormal present present 157.0 90.0 4.1 130.0
4 61.0 80.0 1.015 2.0 0.0 abnormal abnormal notpresent notpresent 173.0 148.0 3.9 135.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
153 55.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent 140.0 49.0 0.5 150.0
154 42.0 70.0 1.025 0.0 0.0 normal normal notpresent notpresent 75.0 31.0 1.2 141.0
155 12.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent 100.0 26.0 0.6 137.0
156 17.0 60.0 1.025 0.0 0.0 normal normal notpresent notpresent 114.0 50.0 1.0 135.0
157 58.0 80.0 1.025 0.0 0.0 normal normal notpresent notpresent 131.0 18.0 1.1 141.0
url = 'https://drive.google.com/file/d/19oQaKN4NQiGa6Tq9P18Zk3ImbKgzIw9m/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
df = pd.read_csv(path)
data = df.copy()
print('Data connection complete')
Next we're going to build the targets which are stages of CKD in a column we will call "CKD Stages" There are various equations for calculating
GFR but here we will stick with a simpli ed form of it. Please read about GFR in the following link. Source: https://www.niddk.nih.gov/health-
information/professionals/clinical-tools-patient-management/kidney-disease/laboratory-evaluation/glomerular- ltration-rate/estimating
# Used a formula given by NIDDK which was simpler and made it even simpler by
# removing the effect of race and gender
# GFR (mL/min/1.73 m^2) = 175 × (Scr)^-1.154 × (Age)-0.203
data = data[removed_outliers]
2 53
1 45
3 39
Name: CKD Stages, dtype: int64
Histogram
There are various ways to encode categorical data. You learned about some of them during labs. Visit this link and read it thoroughly. Find an
appropriate encoding scheme and transform the categorical attributes of your dataset.
# TO DO: Isolate the categorical features and encode them using a proper encoding method.
# TO DO: Isolate the numverical features and scale them using a proper scaling method.
numerical_ix = X.select_dtypes(include=['float64']).columns
categorical_ix = X.select_dtypes(include=['object']).columns
X = transform.fit_transform(X)
Logistic Regression
# Use Logistic Regression to predict the stage of the kidney function
# and print the accuracies of your model on both train and test data
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
print(f' Accuracy on test set: {accuracy_score(y_test, y_pred):.3f}')
print(f' \nAccuracy on train set: {accuracy_score(y_train, log_reg.predict(X_train)):.3f}')
Confusion Matrix
# TODO: Plot a Confusion Matrix based on your findings from the previous step
# TODO: Plot a Confusion Matrix based on your findings from the previous step
df = pd.DataFrame(data_, columns=['y_true','y_pred'])
confusion_matrix = pd.crosstab(df['y_true'], df['y_pred'], rownames=['ACTUAL'], colnames=['PREDICTED'])
sns.heatmap(confusion_matrix, annot=True)
plt.show()
K-nearest Neighbors
# TO DO: Use Logistic Regression to predict the stage of the kidney function
# and print the accuracies of your model on both train and test data
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(f' Accuracy on test set: {accuracy_score(y_test, y_pred):.3f}')
# TO DO: Find the optimal number of neighbors and print the accuracies
# on both test and train data for that number of neighbors.
accuracies = []
for N in range(1,20):
knn = KNeighborsClassifier(n_neighbors=N)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc = accuracy_score(y_test, y_pred)
accuracies.append(acc)
Best k: 4
Best Accuracy from kNN: 0.886
If your code runs cleanly all the way through, then print as a pdf and submit to Blackboard Module 6 for grading.
0s completed at 11:06 PM