0% found this document useful (0 votes)

111 views5 pages

Hcin620 m6 Lab6 Hanifahmutesi-Finalproject

Uploaded by

api-488096711

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views5 pages

Hcin620 m6 Lab6 Hanifahmutesi-Finalproject

Uploaded by

api-488096711

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

HCIN 620 Lab 6 Course Project

In this project we are tasked with predicting the stages of Chronic Kidney Disease based on Glomerular Filtration Rate (GFR). Information for
dataset is available in this link http://archive.ics.uci.edu/ml//datasets/Chronic_Kidney_Disease

In order to succeed in this nal you will be required to input most of the python code. Please review all the previous labs before you begin, and
read the instructions carefully.

Rather than using a question/answer format, we have commented the code cells with the notation #TODO This is a placeholder technique (a To
Do list, so to speak) commonly used in machine learning. Complete each #TODO task requested of you.

Good Luck!

Notebook by Reza Afra, Ph.D. and Barbara Berkovich, Ph.D., M.A.

Last update December 28, 2020

Step 1: Environment Setup

You have learned that in order to setup an environment for your project, you'd need to rst import the libraries you need.

#ANSWER KEY TODO 1. Import the libraries all the libraries

#Extra libraries are okay, and they only need to have the minimum to complete
#this project without errors.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import ColumnTransformer

# Suppress pesky warnings

import warnings
warnings.filterwarnings("ignore")

#ANSWER KEY TODO 2. Add a print command to acknowledge completion of the import.

print('Import complete')

Import complete

Step 2: Data Cleaning

Upload the data le called data-lab-6-ckd-courseproject.csv data.

Read the csv le into a data frame, and use the name of the dataframe to print the rst and last 5 rows.

#ANSWER KEY TODO 3. Load the data into a pandas dataframe

df = pd.read_csv('data-lab-6-ckd-courseproject.csv')
data = df.copy()

#ANSWER KEY TODO 4. Use the name of the dataframe to print the first and last five rows.

data

Red Blood
Blood Specific Pus Pus Cell Blood Serum
Age Albumin Sugar Blood Bacteria Glucose Sodium P
Pressure Gravity Cell clumps Urea Creatinine
Cells Random

0 48.0 70.0 1.005 4.0 0.0 normal abnormal present notpresent 117.0 56.0 3.8 111.0

1 53.0 90.0 1.020 2.0 0.0 abnormal abnormal present notpresent 70.0 107.0 7.2 114.0

2 63.0 70.0 1.010 3.0 0.0 abnormal abnormal present notpresent 380.0 60.0 2.7 131.0

3 68.0 80.0 1.010 3.0 2.0 normal abnormal present present 157.0 90.0 4.1 130.0

4 61.0 80.0 1.015 2.0 0.0 abnormal abnormal notpresent notpresent 173.0 148.0 3.9 135.0

... ... ... ... ... ... ... ... ... ... ... ... ... ...

153 55.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent 140.0 49.0 0.5 150.0

154 42.0 70.0 1.025 0.0 0.0 normal normal notpresent notpresent 75.0 31.0 1.2 141.0

155 12.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent 100.0 26.0 0.6 137.0

156 17.0 60.0 1.025 0.0 0.0 normal normal notpresent notpresent 114.0 50.0 1.0 135.0

157 58.0 80.0 1.025 0.0 0.0 normal normal notpresent notpresent 131.0 18.0 1.1 141.0

158 rows × 25 columns

data.isnull().sum()

Age 0
Blood Pressure 0
Specific Gravity 0
Albumin 0
Sugar 0
Red Blood Cells 0
Pus Cell 0
Pus Cell clumps 0
Bacteria 0
Blood Glucose Random 0
Blood Urea 0
Serum Creatinine 0
Sodium 0
Potassium 0
Hemoglobin 0
Packed Cell Volume 0
White Blood Cell Count 0
Red Blood Cell Count 0
Hypertension 0
Diabetes Mellitus 0
Coronary Artery Disease 0
Appetite 0
Pedal Edema 0
Anemia 0
Class 0
dtype: int64

le=LabelEncoder()
data.drop_duplicates()

Red Blood
Blood Specific Pus Pus Cell Blood Serum
Age Albumin Sugar Blood Bacteria Glucose Sodium P
Pressure Gravity Cell clumps Urea Creatinine
Cells Random

0 48.0 70.0 1.005 4.0 0.0 normal abnormal present notpresent 117.0 56.0 3.8 111.0

1 53.0 90.0 1.020 2.0 0.0 abnormal abnormal present notpresent 70.0 107.0 7.2 114.0

2 63.0 70.0 1.010 3.0 0.0 abnormal abnormal present notpresent 380.0 60.0 2.7 131.0

3 68.0 80.0 1.010 3.0 2.0 normal abnormal present present 157.0 90.0 4.1 130.0

4 61.0 80.0 1.015 2.0 0.0 abnormal abnormal notpresent notpresent 173.0 148.0 3.9 135.0

... ... ... ... ... ... ... ... ... ... ... ... ... ...

153 55.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent 140.0 49.0 0.5 150.0

154 42.0 70.0 1.025 0.0 0.0 normal normal notpresent notpresent 75.0 31.0 1.2 141.0

155 12.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent 100.0 26.0 0.6 137.0

156 17.0 60.0 1.025 0.0 0.0 normal normal notpresent notpresent 114.0 50.0 1.0 135.0

157 58.0 80.0 1.025 0.0 0.0 normal normal notpresent notpresent 131.0 18.0 1.1 141.0

158 rows × 25 columns

The data is located at 'https://drive.google.com/ le/d/19oQaKN4NQiGa6Tq9P18Zk3ImbKgzIw9m/view?usp=sharing' . In the following cell,

connect to the data and print a statement that indicates you've successfully completed this step.

# Load the data into a pandas dataframe

url = 'https://drive.google.com/file/d/19oQaKN4NQiGa6Tq9P18Zk3ImbKgzIw9m/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
df = pd.read_csv(path)
data = df.copy()
print('Data connection complete')

Data connection complete

Add more data cleaning...

Step 3: Exploratory Data Analysis (EDA) and Preprocessing

Next we're going to build the targets which are stages of CKD in a column we will call "CKD Stages" There are various equations for calculating
GFR but here we will stick with a simpli ed form of it. Please read about GFR in the following link. Source: https://www.niddk.nih.gov/health-
information/professionals/clinical-tools-patient-management/kidney-disease/laboratory-evaluation/glomerular- ltration-rate/estimating

# Used a formula given by NIDDK which was simpler and made it even simpler by
# removing the effect of race and gender
# GFR (mL/min/1.73 m^2) = 175 × (Scr)^-1.154 × (Age)-0.203

# DO NOT change the code below

def calc_gfr(Scr, Age):
return (175 * (Scr) ** -1.154) * (Age ** -0.203)

# I tried 6 stages but the dataset is too small so classes 4 and 5

# had no instances. I reduced the number of classes to 3.
def calc_ckd_stage(gfr):
bins = [0, 45, 90, 250]
labels = [3, 2, 1]
ret = pd.cut(gfr, bins=bins, labels=labels)
return ret

# DO NOT change the code below

data["GFR"] = calc_gfr(data["Serum Creatinine"], data["Age"])
gfr = data['GFR']
removed_outliers = gfr.between(gfr.quantile(.00), gfr.quantile(.95))

data = data[removed_outliers]

# DO NOT change the code below

data["CKD Stages"] =calc_ckd_stage(data["GFR"])

# DO NOT change the code below

data["CKD Stages"].value_counts()

2 53
1 45
3 39
Name: CKD Stages, dtype: int64

Histogram

# TODO: Print a histogram of the values of "Serum Creatinine"

data["Serum Creatinine"].hist();
Scatterplot

# TODO: Print a scatterplot of the values of "GFR" vs "Serum Creatinine"

sns.scatterplot(data["Serum Creatinine"], data["GFR"]);

Isolate features from target

# TO DO: Isolate features and targets

X = data.drop(['Class', 'CKD Stages'], axis=1)

y = data["CKD Stages"]

There are various ways to encode categorical data. You learned about some of them during labs. Visit this link and read it thoroughly. Find an
appropriate encoding scheme and transform the categorical attributes of your dataset.

# TO DO: Isolate the categorical features and encode them using a proper encoding method.
# TO DO: Isolate the numverical features and scale them using a proper scaling method.

numerical_ix = X.select_dtypes(include=['float64']).columns
categorical_ix = X.select_dtypes(include=['object']).columns

#define the data preparation for the columns

t = [('cat', OneHotEncoder(), categorical_ix), ('num', StandardScaler(), numerical_ix)]
transform = ColumnTransformer(transformers=t)

X = transform.fit_transform(X)

#TO DO: Print a heatmap of correlation between features

plt.figure(figsize=(8, 8))

sns.heatmap(data.corr(), cbar=True, annot=False, yticklabels=numerical_ix,

xticklabels=numerical_ix);

Split the Data

# TODO: Split the data to train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=data['CKD Stages'], random_state=308)

Step 4: Build the Models and Evaluate

Logistic Regression
# Use Logistic Regression to predict the stage of the kidney function
# and print the accuracies of your model on both train and test data
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
print(f' Accuracy on test set: {accuracy_score(y_test, y_pred):.3f}')
print(f' \nAccuracy on train set: {accuracy_score(y_train, log_reg.predict(X_train)):.3f}')

Accuracy on test set: 0.971

Accuracy on train set: 0.990

Confusion Matrix

# TODO: Plot a Confusion Matrix based on your findings from the previous step

data_ = {'y_true': y_test,

'y_pred': y_pred
}

df = pd.DataFrame(data_, columns=['y_true','y_pred'])
confusion_matrix = pd.crosstab(df['y_true'], df['y_pred'], rownames=['ACTUAL'], colnames=['PREDICTED'])

sns.heatmap(confusion_matrix, annot=True)
plt.show()

K-nearest Neighbors

# TO DO: Use Logistic Regression to predict the stage of the kidney function
# and print the accuracies of your model on both train and test data
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
print(f' Accuracy on test set: {accuracy_score(y_test, y_pred):.3f}')

Accuracy on test set: 0.857

# TO DO: Find the optimal number of neighbors and print the accuracies
# on both test and train data for that number of neighbors.
accuracies = []
for N in range(1,20):
knn = KNeighborsClassifier(n_neighbors=N)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc = accuracy_score(y_test, y_pred)
accuracies.append(acc)

accuracies = np.array(accuracies) # convert to numpy array

sns.lineplot(np.arange(1,20),accuracies);
# Find the best k
best_k = 1 + np.argmax(accuracies) # add one b/c arrays are 0-indexed
best_accuracy = np.max(accuracies)
print(f"Best k: {best_k} \nBest Accuracy from kNN: {best_accuracy:.3f}")

Best k: 4
Best Accuracy from kNN: 0.886

If your code runs cleanly all the way through, then print as a pdf and submit to Blackboard Module 6 for grading.
 0s completed at 11:06 PM

Year 9 GCSE Science Revision Notes
0% (1)
Year 9 GCSE Science Revision Notes
24 pages
Heart Failure Prediction
100% (1)
Heart Failure Prediction
41 pages
AIML Record Batch 9
No ratings yet
AIML Record Batch 9
88 pages
Diabetes
No ratings yet
Diabetes
97 pages
Aids
No ratings yet
Aids
88 pages
chapter IV
No ratings yet
chapter IV
32 pages
Horse Colic
100% (1)
Horse Colic
33 pages
Udaan Class-10 Biology Short Notes
No ratings yet
Udaan Class-10 Biology Short Notes
76 pages
Anticoagulants and Other Preservatives
No ratings yet
Anticoagulants and Other Preservatives
26 pages
diabetes-prediction-using-machine-learning
No ratings yet
diabetes-prediction-using-machine-learning
16 pages
Machine Learning Lab Manual (1)
No ratings yet
Machine Learning Lab Manual (1)
42 pages
مختار النعيري - The Course Work Submission (1)
No ratings yet
مختار النعيري - The Course Work Submission (1)
31 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
Class - X - Science (Biology) Chapter - 5 - Life - Processes - Assignment
No ratings yet
Class - X - Science (Biology) Chapter - 5 - Life - Processes - Assignment
4 pages
Razi AML Assignment2
No ratings yet
Razi AML Assignment2
18 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
Binary Prediction of Smoker Status using Bio-Signals
No ratings yet
Binary Prediction of Smoker Status using Bio-Signals
20 pages
eda-ml-decision-tree.ipynb - Colab
No ratings yet
eda-ml-decision-tree.ipynb - Colab
20 pages
8in1 Hydra Facial Mahine User Manual: Model:FQ077-2
No ratings yet
8in1 Hydra Facial Mahine User Manual: Model:FQ077-2
19 pages
Diabetes_Prediction_1704256341
No ratings yet
Diabetes_Prediction_1704256341
17 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
20 pages
AML Sessional 1 Students
No ratings yet
AML Sessional 1 Students
16 pages
Documentation Code
No ratings yet
Documentation Code
20 pages
ML Practical 04
No ratings yet
ML Practical 04
20 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
Diabetes EDA and Kears Modeling
No ratings yet
Diabetes EDA and Kears Modeling
26 pages
Mla - 2 (Cia - 2) - 20221013
No ratings yet
Mla - 2 (Cia - 2) - 20221013
14 pages
463320507_tnolumqqv4uwnp53aauqvb51
No ratings yet
463320507_tnolumqqv4uwnp53aauqvb51
11 pages
RapidLAB 248 User Manual
No ratings yet
RapidLAB 248 User Manual
135 pages
My Code
No ratings yet
My Code
7 pages
#1660908-Data Management and Statistical Computing
No ratings yet
#1660908-Data Management and Statistical Computing
21 pages
Medidas de Tendencia Central 2020 PDF
No ratings yet
Medidas de Tendencia Central 2020 PDF
26 pages
Data Science Lab Report
No ratings yet
Data Science Lab Report
7 pages
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
No ratings yet
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
18 pages
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
No ratings yet
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
10 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
Data Dictionary Data Dictionary: Set The Working Directory Set The Working Directory
No ratings yet
Data Dictionary Data Dictionary: Set The Working Directory Set The Working Directory
15 pages
KNN - Jupyter Notebook (1)
No ratings yet
KNN - Jupyter Notebook (1)
7 pages
45 AIML Practical 09
No ratings yet
45 AIML Practical 09
6 pages
Indian Liver Patient RMarkdown
No ratings yet
Indian Liver Patient RMarkdown
14 pages
Project 190
No ratings yet
Project 190
6 pages
Heart Disease Indicator Prediction Model
No ratings yet
Heart Disease Indicator Prediction Model
17 pages
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Project
No ratings yet
Project
8 pages
Capstone Project 2
No ratings yet
Capstone Project 2
15 pages
IPYNB Converter
No ratings yet
IPYNB Converter
8 pages
Methodolgy
No ratings yet
Methodolgy
8 pages
KNN For Classification
No ratings yet
KNN For Classification
5 pages
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
No ratings yet
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
8 pages
Diabetis Project
No ratings yet
Diabetis Project
7 pages
5
No ratings yet
5
5 pages
Hcin620 Final
No ratings yet
Hcin620 Final
7 pages
Exp 5
No ratings yet
Exp 5
7 pages
healthcare-project-simplilearn- Week1
No ratings yet
healthcare-project-simplilearn- Week1
6 pages
Logistic - Ipynb - Colaboratory
No ratings yet
Logistic - Ipynb - Colaboratory
6 pages
ML Practical 3D
No ratings yet
ML Practical 3D
4 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
Experiment 4
No ratings yet
Experiment 4
5 pages
Mod 4
No ratings yet
Mod 4
2 pages
KNN For Classification
No ratings yet
KNN For Classification
4 pages
Metode Advia 120
No ratings yet
Metode Advia 120
24 pages
Teacher-Made Learner's Home Task: Samboan National High School Grade 9 - Ruby Science
No ratings yet
Teacher-Made Learner's Home Task: Samboan National High School Grade 9 - Ruby Science
29 pages
Pima Indian Diabetes Questions
No ratings yet
Pima Indian Diabetes Questions
6 pages
Lecture Hemostasis
No ratings yet
Lecture Hemostasis
28 pages
Some of The Life Processes in The Living Beings Are Described Below
No ratings yet
Some of The Life Processes in The Living Beings Are Described Below
21 pages
0610 BIOLOGY: MARK SCHEME For The October/November 2010 Question Paper For The Guidance of Teachers
No ratings yet
0610 BIOLOGY: MARK SCHEME For The October/November 2010 Question Paper For The Guidance of Teachers
14 pages
Mahmoorganj CC - 2 Dr. Lal Path Labs LTD Lanka, Varanasi 221005
No ratings yet
Mahmoorganj CC - 2 Dr. Lal Path Labs LTD Lanka, Varanasi 221005
4 pages
Chronic Kidney Disease - Info
No ratings yet
Chronic Kidney Disease - Info
3 pages
LIFE PROCESSES - WORK SHEET (1)
No ratings yet
LIFE PROCESSES - WORK SHEET (1)
14 pages
First Quarterly Exam in Science 9
No ratings yet
First Quarterly Exam in Science 9
21 pages
Learning About Human Body
No ratings yet
Learning About Human Body
14 pages
Artificial Neural Network (Ann)
No ratings yet
Artificial Neural Network (Ann)
1 page
HAP Lab Quiz (Midterm)
No ratings yet
HAP Lab Quiz (Midterm)
11 pages
Systemic and Pulmonary Circulation
No ratings yet
Systemic and Pulmonary Circulation
13 pages
8ca - Aerobic Respiration: Word Pronunciation Meaning
No ratings yet
8ca - Aerobic Respiration: Word Pronunciation Meaning
3 pages
Yashi Jain (23Y/F) Comprehensive Full Body Checkup With Vitamin D and B12 - New
No ratings yet
Yashi Jain (23Y/F) Comprehensive Full Body Checkup With Vitamin D and B12 - New
13 pages
IB HL Biology - Paper 3 - Condensed Answers To Past Paper Questions
No ratings yet
IB HL Biology - Paper 3 - Condensed Answers To Past Paper Questions
9 pages
2442085529
No ratings yet
2442085529
7 pages
Blood System 5 PDF
No ratings yet
Blood System 5 PDF
10 pages
Glicosimetros Wess e Reusch 2000
No ratings yet
Glicosimetros Wess e Reusch 2000
6 pages
Blood Typing Challenge Worksheet (1)
No ratings yet
Blood Typing Challenge Worksheet (1)
3 pages
Drug Study Ferrous Sulfate
No ratings yet
Drug Study Ferrous Sulfate
2 pages
Exame Imprimir
No ratings yet
Exame Imprimir
4 pages
WEEK 8 LAB EXERCISE - CVS & Bood Vessels
No ratings yet
WEEK 8 LAB EXERCISE - CVS & Bood Vessels
6 pages
Postmortem Toxicology
No ratings yet
Postmortem Toxicology
2 pages
Laboratory Report: Ms. Rekha Dubey
No ratings yet
Laboratory Report: Ms. Rekha Dubey
1 page
Inside The Living Body (Transcript)
No ratings yet
Inside The Living Body (Transcript)
1 page
Favorite Flies for the Great Smoky Mountains: 50 Essential Patterns from Local Experts
From Everand
Favorite Flies for the Great Smoky Mountains: 50 Essential Patterns from Local Experts
Kevin Howell
No ratings yet
Its HOT! Build a Temperature Warning Sound Alarm with Thermistor
From Everand
Its HOT! Build a Temperature Warning Sound Alarm with Thermistor
GURUPRASAD N H
No ratings yet
From Overweight to Weight Loss
From Everand
From Overweight to Weight Loss
Mister Dred
No ratings yet

Hcin620 m6 Lab6 Hanifahmutesi-Finalproject

Uploaded by

Hcin620 m6 Lab6 Hanifahmutesi-Finalproject

Uploaded by

HCIN 620 Lab 6 Course Project

Notebook by Reza Afra, Ph.D. and Barbara Berkovich, Ph.D., M.A.

Last update December 28, 2020

Step 1: Environment Setup

#ANSWER KEY TODO 1. Import the libraries all the libraries

from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Suppress pesky warnings

Step 2: Data Cleaning

#ANSWER KEY TODO 3. Load the data into a pandas dataframe

158 rows × 25 columns

158 rows × 25 columns

The data is located at 'https://drive.google.com/ le/d/19oQaKN4NQiGa6Tq9P18Zk3ImbKgzIw9m/view?usp=sharing' . In the following cell,

# Load the data into a pandas dataframe

Data connection complete

Add more data cleaning...

Step 3: Exploratory Data Analysis (EDA) and Preprocessing

# DO NOT change the code below

# I tried 6 stages but the dataset is too small so classes 4 and 5

# DO NOT change the code below

# DO NOT change the code below

# DO NOT change the code below

# TODO: Print a histogram of the values of "Serum Creatinine"

# TODO: Print a scatterplot of the values of "GFR" vs "Serum Creatinine"

Isolate features from target

# TO DO: Isolate features and targets

X = data.drop(['Class', 'CKD Stages'], axis=1)

#define the data preparation for the columns

#TO DO: Print a heatmap of correlation between features

sns.heatmap(data.corr(), cbar=True, annot=False, yticklabels=numerical_ix,

Split the Data

# TODO: Split the data to train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=data['CKD Stages'], random_state=308)

Step 4: Build the Models and Evaluate

Accuracy on test set: 0.971

Accuracy on train set: 0.990

data_ = {'y_true': y_test,

Accuracy on test set: 0.857

accuracies = np.array(accuracies) # convert to numpy array

You might also like