Decision_Tree-Random_Forest - Jupyter Notebook
Decision_Tree-Random_Forest - Jupyter Notebook
DT RF
About the data set (Employee data)
The dataset contains information about employees. The aim is to find which employees
might undergo attrition.
Attribute information:
BusinessTravel: How much travel is involved in the job for the employee:No Travel, Travel
Frequently, Tavel Rarely
YearsAtCompany: Number of years the employee has worked with the company
Table of Content
1. Decision tree
2. Random forest
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 1/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook
In [2]: 1 plt.rcParams['figure.figsize']=[16,8]
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 2/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook
Out[3]:
0 1 2 3
Age 33 32 40 42
DistanceFromHome 3 4 9 7
EnvironmentSatisfaction 1 4 4 2
JobInvolvement 3 1 3 4
JobLevel 1 3 1 2
JobSatisfaction 1 4 1 2
NumCompaniesWorked 1 1 3 6
PercentSalaryHike 11 11 14 16
PerformanceRating 3 3 3 3
YearsAtCompany 10 14 5 1
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 3/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook
In [4]: 1 df_employee.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1580 entries, 0 to 1579
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1580 non-null int64
1 Attrition 1580 non-null object
2 BusinessTravel 1580 non-null object
3 Department 1580 non-null object
4 DistanceFromHome 1580 non-null int64
5 EducationField 1580 non-null object
6 EnvironmentSatisfaction 1580 non-null int64
7 Gender 1580 non-null object
8 JobInvolvement 1580 non-null int64
9 JobLevel 1580 non-null int64
10 JobRole 1580 non-null object
11 JobSatisfaction 1580 non-null int64
12 MonthlyIncome 1580 non-null int64
13 NumCompaniesWorked 1580 non-null int64
14 OverTime 1580 non-null object
15 PercentSalaryHike 1580 non-null int64
16 PerformanceRating 1580 non-null int64
17 YearsAtCompany 1580 non-null int64
dtypes: int64(11), object(7)
memory usage: 222.3+ KB
1. Decision tree
Detect the outliers in the dataset. Remove the outliers using IQR method, if
present.
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 4/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook
In [5]: 1 df_employee.boxplot()
2 plt.xticks(rotation='vertical',fontsize=12)
3 plt.show()
In [6]: 1 q1=df_employee.quantile(0.25)
2 q3=df_employee.quantile(0.75)
3 iqr=q3-q1
4 df_employee=df_employee[~((df_employee<(q1-1.5*iqr))|(df_employee>
5 df_employee=df_employee.reset_index(drop=True)
6 df_employee.shape
Build a full model to predict if an employee will leave the company. Find
three features that impact the model prediction the most.
In [7]: 1 df_target=df_employee['Attrition']
2 df_feature=df_employee.drop('Attrition',axis=1)
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 5/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook
In [9]: 1 df_num=df_feature.select_dtypes(include=[np.number])
2 df_cat=df_feature.select_dtypes(include=[np.object])
3 dummy_var=pd.get_dummies(data=df_cat,drop_first=True)
4 X=pd.concat([df_num,dummy_var],axis=1)
5 X.head()
Out[9]:
Age DistanceFromHome EnvironmentSatisfaction JobInvolvement JobLevel JobSat
0 33 3 1 3 1
1 32 4 4 1 3
2 40 9 4 3 1
3 42 7 2 4 2
4 43 27 3 3 3
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 6/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook
In [10]: 1 X_train,X_test,y_train,y_test=train_test_split(X,df_target,test_si
2 dt_full=DecisionTreeClassifier(random_state=50).fit(X_train,y_trai
3 y_pred_full=dt_full.predict(X_test) dt_full.feature_importances_
4 imp_features=pd.DataFrame({'Features':X_train.columns,'Importance'
5 imp_features.sort_values(by='Importance',ascending=False)
Out[10]:
Features Importance
6 MonthlyIncome 0.147614
29 OverTime_Yes 0.099595
1 DistanceFromHome 0.096689
10 YearsAtCompany 0.094220
8 PercentSalaryHike 0.094219
0 Age 0.089836
2 EnvironmentSatisfaction 0.059804
5 JobSatisfaction 0.054751
7 NumCompaniesWorked 0.050974
3 JobInvolvement 0.040995
20 Gender_Male 0.040213
18 EducationField_Other 0.016882
4 JobLevel 0.015257
17 EducationField_Medical 0.009292
14 Department_Sales 0.008895
11 BusinessTravel_Travel_Frequently 0.007876
12 BusinessTravel_Travel_Rarely 0.003221
16 EducationField_Marketing 0.003092
23 JobRole_Manager 0.000000
9 PerformanceRating 0.000000
Find the area under the receiver operating characteristic curve for full
model built in previous question
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 7/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook
In [23]: 1 fpr,tpr,thresholds=roc_curve(y_test,y_pred_full)
2 plt.plot([0,1],[0,1],'r--')
3 plt.plot(fpr,tpr)
4 plt.text(x=0.02,y=0.8,s=('AUC Score:',round(metrics.roc_auc_score(y
5 plt.grid(True)
6 plt.show()
In [25]: 1 cm=confusion_matrix(y_test,y_pred_full)
2 conf_matrix=pd.DataFrame(data=cm,columns=['Predict Attr:No','Predi
3 sns.heatmap(conf_matrix,annot=True,fmt='d',cmap=ListedColormap(['l
4 plt.show()
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 8/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook
In [26]: 1 tn=cm[0][0]
2 tp=cm[1][1]
3 fp=cm[0][1]
4 fn=cm[1][0]
5 total=tn+tp+fp+fn
6 correct_classify=((tn+tp)/total)*100
7 correct_classify
Out[26]: 89.48545861297539
In [ ]: 1 mis_classify=((fn+fp)/total)*100
2 mis_classify
In [ ]: 1 specificity=tn/(tn+fp)
2 specificity
In [ ]: 1 sensitivity=tp/(tp+fn)
2 sensitivity
In [13]: 1 dt_2=DecisionTreeClassifier(max_leaf_nodes=5,random_state=50).fit(X
2 tree.plot_tree(dt_2,class_names=['No','Yes'],feature_names=X_train
3 plt.show()
Build 5 decision trees each with 20 random features. Also predict the
attrition for test set for each model.
In [14]: 1 columns=list(X_train.columns)
2 sample_features=random.choices(columns,k=20)
3 dt_model_1=DecisionTreeClassifier(random_state=20).fit(X_train[sam
4 y_pred_1=dt_model_1.predict(X_test[sample_features])
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 9/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook
In [16]: 1 sample_features=random.choices(columns,k=20)
2 dt_model_2=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_2=dt_model_2.predict(X_test[sample_features])
In [17]: 1 sample_features=random.choices(columns,k=20)
2 dt_model_3=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_3=dt_model_3.predict(X_test[sample_features])
In [18]: 1 sample_features=random.choices(columns,k=20)
2 dt_model_4=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_4=dt_model_4.predict(X_test[sample_features])
In [19]: 1 sample_features=random.choices(columns,k=20)
2 dt_model_5=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_5=dt_model_5.predict(X_test[sample_features])
In [20]: 1 model_predictions_df=pd.DataFrame({'y_pred_1':y_pred_1,'y_pred_2':y
2 model_predictions_df.head()
Out[20]:
y_pred_1 y_pred_2 y_pred_3 y_pred_4 y_pred_5
0 0 0 0 0 0
1 0 0 0 1 0
2 1 1 1 1 1
3 0 0 1 0 0
4 1 0 0 0 1
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 10/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook
In [21]: 1 votes=[]
2 for i in range(model_predictions_df.shape[0]):
3 votes.append(model_predictions_df.iloc[i].value_counts().index
4 model_predictions_df['Voted_Results']=votes
5 model_predictions_df.head()
Out[21]:
y_pred_1 y_pred_2 y_pred_3 y_pred_4 y_pred_5 Voted_Results
0 0 0 0 0 0 0
1 0 0 0 1 0 0
2 1 1 1 1 1 1
3 0 0 1 0 0 0
4 1 0 0 0 1 0
Consider the values of "Voted_Result" as our new predictions and store its
values in a variable "new_y_pred" and find the accuracy and the roc-auc
score using new_y_pred.
In [22]: 1 new_y_pred=model_predictions_df['Voted_Results']
2 print("Accuracy:",accuracy_score(y_test,new_y_pred))
3 print("ROC-AUC Score:",roc_auc_score(y_test,new_y_pred))
Accuracy: 0.9485458612975392
ROC-AUC Score: 0.9548693830608723
2. Random Forest
Build a random forest full model to predict if an employee will leave the
company or not and generate a classification report.
In [24]: 1 rf_model=RandomForestClassifier(n_estimators=10,random_state=50).f
2 y_pred=rf_model.predict(X_test)
3 print(classification_report(y_test,y_pred))
In [ ]: 1
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 11/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook
localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 12/12