0% found this document useful (0 votes)
20 views

LogisticRegressionMLModel - Jupyter Notebook

The document discusses preparing titanic passenger data for logistic regression modeling. It loads and inspects the data, then cleans the data through one-hot encoding of categorical variables and transforming gender to numeric values.

Uploaded by

satyamk86770
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

LogisticRegressionMLModel - Jupyter Notebook

The document discusses preparing titanic passenger data for logistic regression modeling. It loads and inspects the data, then cleans the data through one-hot encoding of categorical variables and transforming gender to numeric values.

Uploaded by

satyamk86770
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [3]: import pandas as pd

Collect and Understand the data


In [4]: # Taking titanic dataset in titanic variable
titanic = pd.read_csv("titanic_train.csv")

In [7]: titanic.shape

Out[7]: (891, 12)

In [8]: titanic.head()

Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley (Florence


1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques Heath (Lily May


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Peel)

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 1/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [9]: titanic.describe()

Out[9]:
PassengerId Survived Pclass Age SibSp Parch Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

In [10]: titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 2/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [11]: import matplotlib.pyplot as plt

In [12]: # Plot the gender distribution



gender_counts = titanic['Sex'].value_counts()
gender_counts.plot(kind='bar', color=['blue', 'pink']) # Assuming blue for male and pink for female
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.xticks(rotation=0) # Rotate x-axis labels if needed
plt.show()

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 3/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

Processing of Data
In [14]: # Categorical variables need to be transformed into numeric variables
# Using One Hot Encoding
# There are three ports: C = Cherbourg, Q = Queenstown, S = Southampton

In [15]: ports = pd.get_dummies(titanic.Embarked , prefix='Embarked')


ports.head()

Out[15]:
Embarked_C Embarked_Q Embarked_S

0 0 0 1

1 1 0 0

2 0 0 1

3 0 0 1

4 0 0 1

In [17]: ports.shape

Out[17]: (891, 3)

In [18]: titanic = titanic.join(ports)


# then drop the original column
titanic.drop(['Embarked'], axis=1, inplace=True)

In [20]: titanic.shape

Out[20]: (891, 14)

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 4/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [22]: # Transforming the gender feature to make it numerical


titanic.Sex = titanic.Sex.map({'male':0, 'female':1})

In [24]: titanic.head()

Out[24]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked_C Embarked_Q Embarked_S

Braund, Mr.
0 1 0 3 0 22.0 1 0 A/5 21171 7.2500 NaN 0 0 1
Owen Harris

Cumings,
Mrs. John
1 2 1 1 Bradley 1 38.0 1 0 PC 17599 71.2833 C85 1 0 0
(Florence
Briggs Th...

Heikkinen, STON/O2.
2 3 1 3 1 26.0 0 0 7.9250 NaN 0 0 1
Miss. Laina 3101282

Futrelle, Mrs.
Jacques
3 4 1 1 1 35.0 1 0 113803 53.1000 C123 0 0 1
Heath (Lily
May Peel)

Allen, Mr.
4 5 0 3 William 0 35.0 0 0 373450 8.0500 NaN 0 0 1
Henry

Extracting the target variable


In [25]: y = titanic.Survived.copy()

In [27]: y.shape

Out[27]: (891,)

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 5/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [28]: X = titanic.drop(['Survived'], axis=1) # then, drop y column

In [30]: X.shape

Out[30]: (891, 13)

Droping the useless Column


In [42]: X.drop(['Name'], axis=1, inplace=True)
X.drop(['PassengerId'], axis=1, inplace=True)

In [43]: X.drop(['Ticket'], axis=1, inplace=True)

In [44]: X.drop(['Cabin'], axis=1, inplace=True)

In [45]: X.shape

Out[45]: (891, 9)

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 6/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [46]: X.head()

Out[46]:
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S

0 3 0 22.0 1 0 7.2500 0 0 1

1 1 1 38.0 1 0 71.2833 1 0 0

2 3 1 26.0 0 0 7.9250 0 0 1

3 1 1 35.0 1 0 53.1000 0 0 1

4 3 0 35.0 0 0 8.0500 0 0 1

In [47]: X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 891 non-null int64
1 Sex 891 non-null int64
2 Age 714 non-null float64
3 SibSp 891 non-null int64
4 Parch 891 non-null int64
5 Fare 891 non-null float64
6 Embarked_C 891 non-null uint8
7 Embarked_Q 891 non-null uint8
8 Embarked_S 891 non-null uint8
dtypes: float64(2), int64(4), uint8(3)
memory usage: 44.5 KB

PreProcessing NULL Value

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 7/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [48]: X.isnull().values.any()

Out[48]: True

In [49]: null_count = X['Age'].isnull().sum()



# Calculate the total number of rows
total_rows = len(X)

# Calculate the percentage of null values
percentage_null = (null_count / total_rows) * 100

print(f"Percentage of null values in 'age' column: {percentage_null:.2f}%")

Percentage of null values in 'age' column: 19.87%

In [51]: # Because the percentage of missing value of AGE is very high we can'nt remove the given row
X.Age.fillna(X.Age.mean(), inplace=True) # replace NaN with average age

In [52]: X.isnull().values.any()

Out[52]: False

Split the dataset into training and validation


In [53]: from sklearn.model_selection import train_test_split
# 80 % go into the training test, 20% in the validation test
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)

Applying Logistic Regression


localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 8/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [54]: from sklearn.linear_model import LogisticRegression


model = LogisticRegression()

In [55]: model.fit(X_train, y_train)

C:\ProgramData\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning: lbfgs failed


to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html (https://scikit-learn.org/stable/modules/preprocessi
ng.html)
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (https://scikit-learn.org/stable/
modules/linear_model.html#logistic-regression)
n_iter_i = _check_optimize_result(

Out[55]: ▾ LogisticRegression
LogisticRegression()

Evaluating the model


In [56]: model.score(X_train, y_train)

Out[56]: 0.8089887640449438

In [57]: model.score(X_valid, y_valid)

Out[57]: 0.7541899441340782

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 9/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

Applying feature Construction to Increase accuracy


In [58]: y = titanic.Survived.copy()

In [59]: y.shape

Out[59]: (891,)

In [60]: X = titanic.drop(['Survived'], axis=1) # then, drop y column

In [61]: X.shape

Out[61]: (891, 13)

In [62]: X.drop(['Name'], axis=1, inplace=True)


X.drop(['PassengerId'], axis=1, inplace=True)

In [63]: X.drop(['Ticket'], axis=1, inplace=True)

In [64]: X.drop(['Cabin'], axis=1, inplace=True)

In [65]: X.shape

Out[65]: (891, 9)

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 10/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [67]: # Because the percentage of missing value of AGE is very high we can'nt remove the given row
X.Age.fillna(X.Age.mean(), inplace=True) # replace NaN with average age

In [68]: X.isnull().values.any()

Out[68]: False

In [69]: X.head()

Out[69]:
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S

0 3 0 22.0 1 0 7.2500 0 0 1

1 1 1 38.0 1 0 71.2833 1 0 0

2 3 1 26.0 0 0 7.9250 0 0 1

3 1 1 35.0 1 0 53.1000 0 0 1

4 3 0 35.0 0 0 8.0500 0 0 1

In [70]: X['Family_size'] = X['SibSp'] + X['Parch'] + 1

In [71]: X.head()

Out[71]:
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S Family_size

0 3 0 22.0 1 0 7.2500 0 0 1 2

1 1 1 38.0 1 0 71.2833 1 0 0 2

2 3 1 26.0 0 0 7.9250 0 0 1 1

3 1 1 35.0 1 0 53.1000 0 0 1 2

4 3 0 35.0 0 0 8.0500 0 0 1 1

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 11/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [72]: def myfunc(num):


if num == 1:
#alone
return 0
elif num >1 and num <=4:
# small family
return 1
else:
# large family
return 2

In [73]: X['Family_type'] = X['Family_size'].apply(myfunc)

In [74]: X.head()

Out[74]:
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S Family_size Family_type

0 3 0 22.0 1 0 7.2500 0 0 1 2 1

1 1 1 38.0 1 0 71.2833 1 0 0 2 1

2 3 1 26.0 0 0 7.9250 0 0 1 1 0

3 1 1 35.0 1 0 53.1000 0 0 1 2 1

4 3 0 35.0 0 0 8.0500 0 0 1 1 0

In [75]: X.drop(columns=['SibSp','Parch','Family_size'],inplace=True)

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 12/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [76]: X.head()

Out[76]:
Pclass Sex Age Fare Embarked_C Embarked_Q Embarked_S Family_type

0 3 0 22.0 7.2500 0 0 1 1

1 1 1 38.0 71.2833 1 0 0 1

2 3 1 26.0 7.9250 0 0 1 0

3 1 1 35.0 53.1000 0 0 1 1

4 3 0 35.0 8.0500 0 0 1 0

In [77]: from sklearn.model_selection import train_test_split


# 80 % go into the training test, 20% in the validation test
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)

In [78]: model = LogisticRegression()

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 13/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook

In [79]: model.fit(X_train, y_train)

C:\ProgramData\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning: lbfgs failed


to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html (https://scikit-learn.org/stable/modules/preprocessi
ng.html)
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (https://scikit-learn.org/stable/
modules/linear_model.html#logistic-regression)
n_iter_i = _check_optimize_result(

Out[79]: ▾ LogisticRegression
LogisticRegression()

In [80]: model.score(X_train, y_train)

Out[80]: 0.8132022471910112

In [81]: model.score(X_valid, y_valid)

Out[81]: 0.7541899441340782

In [ ]: ​

localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 14/14

You might also like