LogisticRegressionMLModel - Jupyter Notebook
LogisticRegressionMLModel - Jupyter Notebook
In [7]: titanic.shape
In [8]: titanic.head()
Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 1/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
In [9]: titanic.describe()
Out[9]:
PassengerId Survived Pclass Age SibSp Parch Fare
In [10]: titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 2/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 3/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
Processing of Data
In [14]: # Categorical variables need to be transformed into numeric variables
# Using One Hot Encoding
# There are three ports: C = Cherbourg, Q = Queenstown, S = Southampton
Out[15]:
Embarked_C Embarked_Q Embarked_S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
In [17]: ports.shape
Out[17]: (891, 3)
In [20]: titanic.shape
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 4/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
In [24]: titanic.head()
Out[24]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked_C Embarked_Q Embarked_S
Braund, Mr.
0 1 0 3 0 22.0 1 0 A/5 21171 7.2500 NaN 0 0 1
Owen Harris
Cumings,
Mrs. John
1 2 1 1 Bradley 1 38.0 1 0 PC 17599 71.2833 C85 1 0 0
(Florence
Briggs Th...
Heikkinen, STON/O2.
2 3 1 3 1 26.0 0 0 7.9250 NaN 0 0 1
Miss. Laina 3101282
Futrelle, Mrs.
Jacques
3 4 1 1 1 35.0 1 0 113803 53.1000 C123 0 0 1
Heath (Lily
May Peel)
Allen, Mr.
4 5 0 3 William 0 35.0 0 0 373450 8.0500 NaN 0 0 1
Henry
In [27]: y.shape
Out[27]: (891,)
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 5/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
In [30]: X.shape
In [45]: X.shape
Out[45]: (891, 9)
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 6/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
In [46]: X.head()
Out[46]:
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S
0 3 0 22.0 1 0 7.2500 0 0 1
1 1 1 38.0 1 0 71.2833 1 0 0
2 3 1 26.0 0 0 7.9250 0 0 1
3 1 1 35.0 1 0 53.1000 0 0 1
4 3 0 35.0 0 0 8.0500 0 0 1
In [47]: X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 891 non-null int64
1 Sex 891 non-null int64
2 Age 714 non-null float64
3 SibSp 891 non-null int64
4 Parch 891 non-null int64
5 Fare 891 non-null float64
6 Embarked_C 891 non-null uint8
7 Embarked_Q 891 non-null uint8
8 Embarked_S 891 non-null uint8
dtypes: float64(2), int64(4), uint8(3)
memory usage: 44.5 KB
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 7/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
In [48]: X.isnull().values.any()
Out[48]: True
In [51]: # Because the percentage of missing value of AGE is very high we can'nt remove the given row
X.Age.fillna(X.Age.mean(), inplace=True) # replace NaN with average age
In [52]: X.isnull().values.any()
Out[52]: False
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html (https://scikit-learn.org/stable/modules/preprocessi
ng.html)
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (https://scikit-learn.org/stable/
modules/linear_model.html#logistic-regression)
n_iter_i = _check_optimize_result(
Out[55]: ▾ LogisticRegression
LogisticRegression()
Out[56]: 0.8089887640449438
Out[57]: 0.7541899441340782
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 9/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
In [59]: y.shape
Out[59]: (891,)
In [61]: X.shape
In [65]: X.shape
Out[65]: (891, 9)
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 10/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
In [67]: # Because the percentage of missing value of AGE is very high we can'nt remove the given row
X.Age.fillna(X.Age.mean(), inplace=True) # replace NaN with average age
In [68]: X.isnull().values.any()
Out[68]: False
In [69]: X.head()
Out[69]:
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S
0 3 0 22.0 1 0 7.2500 0 0 1
1 1 1 38.0 1 0 71.2833 1 0 0
2 3 1 26.0 0 0 7.9250 0 0 1
3 1 1 35.0 1 0 53.1000 0 0 1
4 3 0 35.0 0 0 8.0500 0 0 1
In [71]: X.head()
Out[71]:
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S Family_size
0 3 0 22.0 1 0 7.2500 0 0 1 2
1 1 1 38.0 1 0 71.2833 1 0 0 2
2 3 1 26.0 0 0 7.9250 0 0 1 1
3 1 1 35.0 1 0 53.1000 0 0 1 2
4 3 0 35.0 0 0 8.0500 0 0 1 1
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 11/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
In [74]: X.head()
Out[74]:
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S Family_size Family_type
0 3 0 22.0 1 0 7.2500 0 0 1 2 1
1 1 1 38.0 1 0 71.2833 1 0 0 2 1
2 3 1 26.0 0 0 7.9250 0 0 1 1 0
3 1 1 35.0 1 0 53.1000 0 0 1 2 1
4 3 0 35.0 0 0 8.0500 0 0 1 1 0
In [75]: X.drop(columns=['SibSp','Parch','Family_size'],inplace=True)
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 12/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
In [76]: X.head()
Out[76]:
Pclass Sex Age Fare Embarked_C Embarked_Q Embarked_S Family_type
0 3 0 22.0 7.2500 0 0 1 1
1 1 1 38.0 71.2833 1 0 0 1
2 3 1 26.0 7.9250 0 0 1 0
3 1 1 35.0 53.1000 0 0 1 1
4 3 0 35.0 8.0500 0 0 1 0
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 13/14
4/26/24, 10:51 PM LogisticRegressionMLModel - Jupyter Notebook
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html (https://scikit-learn.org/stable/modules/preprocessi
ng.html)
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (https://scikit-learn.org/stable/
modules/linear_model.html#logistic-regression)
n_iter_i = _check_optimize_result(
Out[79]: ▾ LogisticRegression
LogisticRegression()
Out[80]: 0.8132022471910112
Out[81]: 0.7541899441340782
In [ ]:
localhost:8888/notebooks/100DaysMLCourse/LogisticRegressionMLModel.ipynb#Evaluate-the-model 14/14