0% found this document useful (0 votes)
6 views22 pages

Logistic _Regresssion

The document is a Jupyter Notebook focused on Logistic Regression for binary classification problems, detailing terminology, common measures, and implementation using Python libraries. It includes explanations of key concepts such as confusion matrix, accuracy, precision, recall, and ROC curve, along with code snippets for data processing and model training. The notebook demonstrates the use of logistic regression to predict admission based on exam scores, showcasing the model's performance metrics and predictions.

Uploaded by

oulla898
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views22 pages

Logistic _Regresssion

The document is a Jupyter Notebook focused on Logistic Regression for binary classification problems, detailing terminology, common measures, and implementation using Python libraries. It includes explanations of key concepts such as confusion matrix, accuracy, precision, recall, and ROC curve, along with code snippets for data processing and model training. The notebook demonstrates the use of logistic regression to predict admission based on exam scores, showcasing the model's performance metrics and predictions.

Uploaded by

oulla898
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

7/2/22, 11:32 PM Logistic - Jupyter Notebook

Logistic Regression
Used to predict classes for binary classification problems

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 1/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 2/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 3/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 4/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 5/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 6/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

Terminology
• Type of classification outputs:
– True positive (m11): Example of class 1 predicted as class 1.
– False positive (m01): Example of class 0 predicted as class 1. Type 1 error.
– True negative (m00): Example of class 0 predicted as class 0.
– False negative (m10): Example of class 1 predicted as class 0. Type II error.
• Total number of instances: m = m00 + m01 + m10 + m11

Error rate: (m01 + m10) / m


– If the classes are imbalanced (e.g. 10% from class 1, 90% from class 0), one can
achieve low error (e.g. 10%) by classifying everything as coming from class 0!

Confusion matrix

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 7/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

Common measures
• Accuracy = (TP+ TN) / (TP + FP + FN + TN)
• Precision = True positives / Total number of declared positives
= TP / (TP+ FP)
• Recall = True positives / Total number of actual positives
= TP / (TP + FN)
• Sensitivity is the same as recall.
• Specificity = True negatives / Total number of actual negatives
= TN / (FP + TN)
• False positive rate = FP / (FP + TN)
• F1 measure= 2(Precision*Recall/(Precision+Recall))

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 8/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

Receiver-operator characteristic (ROC) curve

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 9/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

K Fold Cross Validation

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 10/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 11/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

In [4]:  1 import numpy as np


2 import pandas as pd
3 import matplotlib.pyplot as plt
4 import scipy.optimize as opt
5 ​
6 path = 'data1.txt'
7 data = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admitte
8 print('data = ')
9 print(data.head(10) )
10 print()
11 print('data.describe = ')
12 print(data.describe())
13 positive = data[data['Admitted'].isin([1])]
14 negative = data[data['Admitted'].isin([0])]
15 fig, ax = plt.subplots(figsize=(5,5))
16 ax.scatter(positive['Exam 1'], positive['Exam 2'], s=50, c='b', marker='o
17 label='Admitted')
18 ax.scatter(negative['Exam 1'], negative['Exam 2'], s=50, c='r', marker='x
19 ax.legend()
20 ax.set_xlabel('Exam 1 Score')
21 ax.set_ylabel('Exam 2 Score')
22 data['Admitted'].value_counts()

data =
Exam 1 Exam 2 Admitted
0 34.623660 78.024693 0
1 30.286711 43.894998 0
2 35.847409 72.902198 0
3 60.182599 86.308552 1
4 79.032736 75.344376 1
5 45.083277 56.316372 0
6 61.106665 96.511426 1
7 75.024746 46.554014 1
8 76.098787 87.420570 1
9 84.432820 43.533393 1

data.describe =
Exam 1 Exam 2 Admitted
count 100.000000 100.000000 100.000000
mean 65.644274 66.221998 0.600000
std 19.458222 18.582783 0.492366
min 30.058822 30.603263 0.000000
25% 50.919511 48.179205 0.000000
50% 67.032988 67.682381 1.000000
75% 80.212529 79.360605 1.000000
max 99.827858 98.869436 1.000000

Out[4]: 1 60
0 40
Name: Admitted, dtype: int64

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 12/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

In [5]:  1 def sigmoid(z):


2 return 1 / (1 + np.exp(-z))
3 ​
4 def cost(theta, X, y):
5 theta = np.matrix(theta)
6 X = np.matrix(X)
7 y = np.matrix(y)
8 first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
9 second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
10 return np.sum(first - second) / (len(X))
11 def gradient(theta, X, y):
12 theta = np.matrix(theta)
13 X = np.matrix(X)
14 y = np.matrix(y)
15 parameters = int(theta.ravel().shape[1])
16 grad = np.zeros(parameters)
17 error = sigmoid(X * theta.T) - y
18 for i in range(parameters):
19 term = np.multiply(error, X[:,i])
20 grad[i] = np.sum(term) / len(X)
21 return grad
22 def predict(theta, X):
23 probability = sigmoid(X * theta.T)
24 return [1 if x >= 0.5 else 0 for x in probability]
25 ​

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 13/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

In [6]:  1 nums = np.arange(-10, 10, step=1)


2 fig, ax = plt.subplots(figsize=(5,5))
3 ax.plot(nums, sigmoid(nums), 'r')
4 # add a ones column - this makes the matrix multiplication work out easie
5 data.insert(0, 'Ones', 1)
6 # set X (training data) and y (target variable)
7 cols = data.shape[1]
8 X = data.iloc[:,0:cols-1]
9 y = data.iloc[:,cols-1:cols]
10 # convert to numpy arrays and initalize the parameter array theta
11 X = np.array(X.values)
12 y = np.array(y.values)
13 theta = np.zeros(3)
14 print()
15 print('X.shape = ' , X.shape)
16 print('theta.shape = ' , theta.shape)
17 print('y.shape = ' , y.shape)
18 thiscost = cost(theta, X, y)
19 print()
20 print('cost = ' , thiscost)
21 ​
22 result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(X, y))
23 costafteroptimize = cost(result[0], X, y)
24 print()
25 print('cost after optimize = ' , costafteroptimize)
26 print()
27 ​
28 theta_min = np.matrix(result[0])
29 predictions = predict(theta_min, X)
30 correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (
31 zip(predictions, y)]
32 accuracy = (sum(map(int, correct)) % len(correct))
33 print ('accuracy = {0}%'.format(accuracy))

X.shape = (100, 3)
theta.shape = (3,)
y.shape = (100, 1)

cost = 0.6931471805599453

cost after optimize = 0.20349770158947486

accuracy = 89%

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 14/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 15/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

In [7]:  1 from sklearn.model_selection import train_test_split


2 from sklearn.linear_model import LogisticRegression
3 from sklearn.metrics import confusion_matrix
4 import seaborn as sns
5 import matplotlib.pyplot as plt
6 from sklearn.metrics import accuracy_score
7 from sklearn.metrics import f1_score
8 from sklearn.metrics import recall_score
9 from sklearn.metrics import precision_score
10 from sklearn.metrics import precision_recall_fscore_support
11 from sklearn.metrics import precision_recall_curve
12 from sklearn.metrics import classification_report
13 from sklearn.metrics import roc_curve
14 from sklearn.metrics import auc
15 from sklearn.metrics import roc_auc_score
16 from sklearn.metrics import zero_one_loss
17 from sklearn.metrics import plot_roc_curve
18 path='data1.txt'
19 data = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admitte
20 ositive = data[data['Admitted'].isin([1])]
21 negative = data[data['Admitted'].isin([0])]
22 fig, ax = plt.subplots(figsize=(5,5))
23 ax.scatter(positive['Exam 1'], positive['Exam 2'], s=50, c='b', marker='o
24 label='Admitted')
25 ax.scatter(negative['Exam 1'], negative['Exam 2'], s=50, c='r', marker='x
26 ax.legend()
27 ax.set_xlabel('Exam 1 Score')
28 ax.set_ylabel('Exam 2 Score')
29 cols = data.shape[1]
30 X = data.iloc[:,0:cols-1]
31 y = data.iloc[:,cols-1:cols]
32 X = np.array(X.values)
33 y = np.array(y.values)
34 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
35 ​

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 16/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

In [8]:  1 LogisticRegressionModel = LogisticRegression(solver='liblinear')


2 LogisticRegressionModel.fit(X_train, y_train)
3 LogisticRegression()
4 #Calculating Details
5 print('LogisticRegressionModel Train Score is : ' , LogisticRegressionMod
6 print('LogisticRegressionModel Test Score is : ' , LogisticRegressionMode
7 print('LogisticRegressionModel Classes are : ' , LogisticRegressionModel
8 print('LogisticRegressionModel No. of iteratios is : ' , LogisticRegressi
9 print('----------------------------------------------------')
10 ​

LogisticRegressionModel Train Score is : 0.8625


LogisticRegressionModel Test Score is : 0.9
LogisticRegressionModel Classes are : [0 1]
LogisticRegressionModel No. of iteratios is : [13]
----------------------------------------------------

C:\Users\ralhm\anaconda3\lib\site-packages\sklearn\utils\validation.py:63:
DataConversionWarning: A column-vector y was passed when a 1d array was exp
ected. Please change the shape of y to (n_samples, ), for example using rav
el().
return f(*args, **kwargs)

In [9]:  1 #Calculating Prediction


2 y_pred = LogisticRegressionModel.predict(X_test)
3 y_pred_prob = LogisticRegressionModel.predict_proba(X_test)
4 print('Predicted Value for LogisticRegressionModel is : ' , y_pred[:10])
5 print('Prediction Probabilities Value for LogisticRegressionModel is : '
6 ​
7 #----------------------------------------------------
8 ​

Predicted Value for LogisticRegressionModel is : [1 0 1 1 0 0 1 1 1 1]


Prediction Probabilities Value for LogisticRegressionModel is : [[0.163841
75 0.83615825]
[0.60151281 0.39848719]
[0.28609122 0.71390878]
[0.20357648 0.79642352]
[0.59404439 0.40595561]
[0.50512972 0.49487028]
[0.3937627 0.6062373 ]
[0.35861105 0.64138895]
[0.25021584 0.74978416]
[0.19457158 0.80542842]]

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 17/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

In [11]:  1 #Calculating Confusion Matrix


2 CM = confusion_matrix(y_test, y_pred)
3 print('Confusion Matrix is : \n', CM)
4 ​
5 # drawing confusion matrix
6 sns.heatmap(CM, center = True)
7 plt.show()
8 plot_roc_curve(LogisticRegressionModel,X_test, y_test)
9 plt.show()
10 print(y_test)
11 ​
12 #----------------------------------------------------
13 ​
14 ​

Confusion Matrix is :
[[ 4 2]
[ 0 14]]

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 18/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

[[1]
[0]
[0]
[1]
[0]
[0]
[1]
[1]
[1]
[1]
[0]
[1]
[1]
[0]
[1]
[1]
[1]
[1]
[1]
[1]]

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 19/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

In [47]:  1 #Calculating Accuracy Score : ((TP + TN) / float(TP + TN + FP + FN))


2 AccScore = accuracy_score(y_test, y_pred, normalize=True)
3 print('Accuracy Score is : ', AccScore)
4 ​
5 #----------------------------------------------------
6 #Calculating F1 Score : 2 * (precision * recall) / (precision + recall)
7 # f1_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sa
8 ​
9 F1Score = f1_score(y_test, y_pred, average='micro') #it can be : binary,m
10 print('F1 Score is : ', F1Score)
11 ​
12 #----------------------------------------------------
13 #Calculating Recall Score : (Sensitivity) (TP / float(TP + FN)) 1 / 1+2
14 # recall_score(y_true, y_pred, labels=None, pos_label=1, average=’binary
15 ​
16 RecallScore = recall_score(y_test, y_pred, average='micro') #it can be :
17 print('Recall Score is : ', RecallScore)
18 ​
19 #----------------------------------------------------
20 #Calculating Precision Score : (Specificity) #(TP / float(TP + FP))
21 # precision_score(y_true, y_pred, labels=None, pos_label=1, average=’bina
22 ​
23 PrecisionScore = precision_score(y_test, y_pred, average='micro') #it can
24 print('Precision Score is : ', PrecisionScore)
25 ​
26 #----------------------------------------------------
27 #Calculating Precision recall Score :
28 #metrics.precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels
29 # None, warn_for = ('precision’,’r
30 ​
31 PrecisionRecallScore = precision_recall_fscore_support(y_test, y_pred, av
32 print('Precision Recall Score is : ', PrecisionRecallScore)
33 ​
34 #----------------------------------------------------
35 #Calculating Precision recall Curve :
36 # precision_recall_curve(y_true, probas_pred, pos_label=None, sample_weig
37 ​
38 PrecisionValue, RecallValue, ThresholdsValue = precision_recall_curve(y_t
39 print('Precision Value is : ', PrecisionValue)
40 print('Recall Value is : ', RecallValue)
41 print('Thresholds Value is : ', ThresholdsValue)
42 ​
43 #----------------------------------------------------
44 #Calculating classification Report :
45 #classification_report(y_true, y_pred, labels=None, target_names=None,sam
46 ​
47 ClassificationReport = classification_report(y_test,y_pred)
48 print('Classification Report is : ', ClassificationReport )
49 ​
50 #----------------------------------------------------
51 #Calculating Area Under the Curve :
52 ​
53 fprValue2, tprValue2, thresholdsValue2 = roc_curve(y_test,y_pred)
54 AUCValue = auc(fprValue2, tprValue2)
55 print('AUC Value : ', AUCValue)
56 ​

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 20/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

57 #----------------------------------------------------
58 #Calculating Receiver Operating Characteristic :
59 #roc_curve(y_true, y_score, pos_label=None, sample_weight=None,drop_inter
60 ​
61 fprValue, tprValue, thresholdsValue = roc_curve(y_test,y_pred)
62 print('fpr Value : ', fprValue)
63 print('tpr Value : ', tprValue)
64 print('thresholds Value : ', thresholdsValue)
65 ​
66 #----------------------------------------------------
67 #Calculating ROC AUC Score:
68 #roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None,max_f
69 ​
70 ROCAUCScore = roc_auc_score(y_test,y_pred, average='micro') #it can be :
71 print('ROCAUC Score : ', ROCAUCScore)
72 ​

Accuracy Score is : 0.9


F1 Score is : 0.9
Recall Score is : 0.9
Precision Score is : 0.9
Precision Recall Score is : (0.9, 0.9, 0.9, None)
Precision Value is : [0.875 1. ]
Recall Value is : [1. 0.]
Thresholds Value is : [1]
Classification Report is : precision recall f1-score s
upport

0 1.00 0.67 0.80 6


1 0.88 1.00 0.93 14

accuracy 0.90 20
macro avg 0.94 0.83 0.87 20
weighted avg 0.91 0.90 0.89 20

AUC Value : 0.8333333333333334


fpr Value : [0. 0.33333333 1. ]
tpr Value : [0. 1. 1.]
thresholds Value : [2 1 0]
ROCAUC Score : 0.8333333333333334

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 21/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook

In [67]:  1 #Multiple classes


2 from sklearn.datasets import load_iris
3 from sklearn.linear_model import LogisticRegression
4 X, y = load_iris(return_X_y=True)
5 clf = LogisticRegression(random_state=10, solver='lbfgs' , max_iter= 1000
6 #clf = LogisticRegression(random_state=10, solver='liblinear')
7 #clf = LogisticRegression(random_state=10, solver='saga')
8 print(X.shape)
9 print(X[:2, :])
10 clf.fit(X, y)
11 clf.predict(X[:2, :])
12 print(clf.predict_proba(X[:2, :]))
13 ​
14 score = clf.score(X, y)
15 ​
16 print('score = ' , score)
17 print('No of iterations = ' , clf.n_iter_)
18 print('Classes = ' , clf.classes_)
19 y_pred = clf.predict(X)
20 CM = confusion_matrix(y, y_pred)
21 print(CM)

(150, 4)
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]]
[[9.69810844e-01 3.01885609e-02 5.94808016e-07]
[9.55989484e-01 4.40095854e-02 9.30437140e-07]]
score = 0.9666666666666667
No of iterations = [85]
Classes = [0 1 2]
[[50 0 0]
[ 0 47 3]
[ 0 2 48]]

In [ ]:  1 ​

localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 22/22

You might also like