0% found this document useful (0 votes)
10 views

Decision_Tree-Random_Forest - Jupyter Notebook

Uploaded by

termp89
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Decision_Tree-Random_Forest - Jupyter Notebook

Uploaded by

termp89
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

DT RF
About the data set (Employee data)
The dataset contains information about employees. The aim is to find which employees
might undergo attrition.
Attribute information:

Age: Age of the employee

BusinessTravel: How much travel is involved in the job for the employee:No Travel, Travel
Frequently, Tavel Rarely

Department: Department of the employee: Human Resources, Reserach & Development,


Sales

Commute: Number of miles of daily commute for the employee

Education: Employee education field: Human Resources, Life Sciences, Marketing,


Medical Sciences, Technical, Others

EnvironmentSatisfaction: Satisfaction of employee with office environment

Gender: Employee gender

JobInvolvement: Job involvement rating

JobLevel: Job level for employee designation

JobSatisfaction: Employee job satisfaction rating

MonthlyIncome: Employee monthly salary

OverTime: Has the employee been open to working overtime: Yes or No

PercentSalaryHike: Percent increase in salary

PerformanceRating: Overall employee performance rating

YearsAtCompany: Number of years the employee has worked with the company

Attrition: Employee leaving the company: Yes or No

Table of Content
1. Decision tree
2. Random forest

Import the required libraries

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 1/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [1]:  1 import pandas as pd


2 import numpy as np
3 import matplotlib.pyplot as plt
4 from matplotlib.colors import ListedColormap
5 import seaborn as sns
6 from warnings import filterwarnings
7 filterwarnings('ignore')
8 pd.options.display.max_columns = None
9 pd.options.display.max_rows = None
10 pd.options.display.float_format = '{:.6f}'.format
11 from sklearn.model_selection import train_test_split
12 from sklearn.preprocessing import StandardScaler
13 from sklearn.utils import resample
14 from sklearn.utils import shuffle
15 from sklearn import metrics
16 from sklearn.metrics import classification_report
17 from sklearn.tree import DecisionTreeClassifier
18 from sklearn.ensemble import RandomForestClassifier
19 from sklearn import tree
20 from sklearn.model_selection import GridSearchCV
21 from sklearn.metrics import accuracy_score
22 from sklearn.metrics import roc_curve
23 from sklearn.metrics import roc_auc_score
24 from sklearn.metrics import confusion_matrix
25 from sklearn.model_selection import GridSearchCV
26 from sklearn.model_selection import cross_val_score
27 import pydotplus
28 from IPython.display import Image
29 import random

In [2]:  1 plt.rcParams['figure.figsize']=[16,8]

Load the csv file

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 2/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [3]:  1 df_employee = pd.read_csv('emp_attrition.csv')


2 df_employee.head().transpose()

Out[3]:
0 1 2 3

Age 33 32 40 42

Attrition Yes Yes Yes No

BusinessTravel Travel_Frequently Travel_Rarely Travel_Rarely Travel_Rarely Tra

Research & Research & Research &


Department Sales
Development Development Development

DistanceFromHome 3 4 9 7

EducationField Life Sciences Medical Life Sciences Medical

EnvironmentSatisfaction 1 4 4 2

Gender Male Male Male Female

JobInvolvement 3 1 3 4

JobLevel 1 3 1 2

Research Sales Laboratory Research


JobRole
Scientist Executive Technician Scientist

JobSatisfaction 1 4 1 2

MonthlyIncome 3348 10400 2018 2372

NumCompaniesWorked 1 1 3 6

OverTime Yes No No Yes

PercentSalaryHike 11 11 14 16

PerformanceRating 3 3 3 3

YearsAtCompany 10 14 5 1

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 3/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [4]:  1 df_employee.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1580 entries, 0 to 1579
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1580 non-null int64
1 Attrition 1580 non-null object
2 BusinessTravel 1580 non-null object
3 Department 1580 non-null object
4 DistanceFromHome 1580 non-null int64
5 EducationField 1580 non-null object
6 EnvironmentSatisfaction 1580 non-null int64
7 Gender 1580 non-null object
8 JobInvolvement 1580 non-null int64
9 JobLevel 1580 non-null int64
10 JobRole 1580 non-null object
11 JobSatisfaction 1580 non-null int64
12 MonthlyIncome 1580 non-null int64
13 NumCompaniesWorked 1580 non-null int64
14 OverTime 1580 non-null object
15 PercentSalaryHike 1580 non-null int64
16 PerformanceRating 1580 non-null int64
17 YearsAtCompany 1580 non-null int64
dtypes: int64(11), object(7)
memory usage: 222.3+ KB

Let's begin with some hands-on practice exercises

1. Decision tree

Detect the outliers in the dataset. Remove the outliers using IQR method, if
present.

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 4/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [5]:  1 df_employee.boxplot()
2 plt.xticks(rotation='vertical',fontsize=12)
3 plt.show()

In [6]:  1 q1=df_employee.quantile(0.25)
2 q3=df_employee.quantile(0.75)
3 iqr=q3-q1
4 df_employee=df_employee[~((df_employee<(q1-1.5*iqr))|(df_employee>
5 df_employee=df_employee.reset_index(drop=True)
6 df_employee.shape

Out[6]: (1487, 18)

Build a full model to predict if an employee will leave the company. Find
three features that impact the model prediction the most.

In [7]:  1 df_target=df_employee['Attrition']
2 df_feature=df_employee.drop('Attrition',axis=1)

In [8]:  1 for i in range(len(df_target)):


2 if df_target[i]=='Yes':
3 df_target[i]=1
4 else:
5 df_target[i]=0
6 df_target=df_target.astype('int')

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 5/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [9]:  1 df_num=df_feature.select_dtypes(include=[np.number])
2 df_cat=df_feature.select_dtypes(include=[np.object])
3 dummy_var=pd.get_dummies(data=df_cat,drop_first=True)
4 X=pd.concat([df_num,dummy_var],axis=1)
5 X.head()

Out[9]:
Age DistanceFromHome EnvironmentSatisfaction JobInvolvement JobLevel JobSat

0 33 3 1 3 1

1 32 4 4 1 3

2 40 9 4 3 1

3 42 7 2 4 2

4 43 27 3 3 3

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 6/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [10]:  1 X_train,X_test,y_train,y_test=train_test_split(X,df_target,test_si
2 dt_full=DecisionTreeClassifier(random_state=50).fit(X_train,y_trai
3 y_pred_full=dt_full.predict(X_test) dt_full.feature_importances_
4 imp_features=pd.DataFrame({'Features':X_train.columns,'Importance'
5 imp_features.sort_values(by='Importance',ascending=False)

Out[10]:
Features Importance

6 MonthlyIncome 0.147614

29 OverTime_Yes 0.099595

1 DistanceFromHome 0.096689

10 YearsAtCompany 0.094220

8 PercentSalaryHike 0.094219

0 Age 0.089836

2 EnvironmentSatisfaction 0.059804

5 JobSatisfaction 0.054751

7 NumCompaniesWorked 0.050974

3 JobInvolvement 0.040995

20 Gender_Male 0.040213

19 EducationField_Technical Degree 0.026531

18 EducationField_Other 0.016882

21 JobRole_Human Resources 0.015804

4 JobLevel 0.015257

26 JobRole_Research Scientist 0.010973

17 EducationField_Medical 0.009292

14 Department_Sales 0.008895

24 JobRole_Manufacturing Director 0.008432

11 BusinessTravel_Travel_Frequently 0.007876

22 JobRole_Laboratory Technician 0.004835

12 BusinessTravel_Travel_Rarely 0.003221

16 EducationField_Marketing 0.003092

28 JobRole_Sales Representative 0.000000

27 JobRole_Sales Executive 0.000000

13 Department_Research & Development 0.000000

25 JobRole_Research Director 0.000000

23 JobRole_Manager 0.000000

9 PerformanceRating 0.000000

15 EducationField_Life Sciences 0.000000

Find the area under the receiver operating characteristic curve for full
model built in previous question

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 7/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [23]:  1 fpr,tpr,thresholds=roc_curve(y_test,y_pred_full)
2 plt.plot([0,1],[0,1],'r--')
3 plt.plot(fpr,tpr)
4 plt.text(x=0.02,y=0.8,s=('AUC Score:',round(metrics.roc_auc_score(y
5 plt.grid(True)
6 plt.show()

Plot a confusion matrix for the full model built in Q3.

In [25]:  1 cm=confusion_matrix(y_test,y_pred_full)
2 conf_matrix=pd.DataFrame(data=cm,columns=['Predict Attr:No','Predi
3 sns.heatmap(conf_matrix,annot=True,fmt='d',cmap=ListedColormap(['l
4 plt.show()

Calculate the specificity, sensitivity, % of misclassified and correctly


classified observations. What can you say about the model by looking at
the sensitivity and specificity values? Is this a good model?

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 8/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [26]:  1 tn=cm[0][0]
2 tp=cm[1][1]
3 fp=cm[0][1]
4 fn=cm[1][0]
5 total=tn+tp+fp+fn
6 correct_classify=((tn+tp)/total)*100
7 correct_classify

Out[26]: 89.48545861297539

In [ ]:  1 mis_classify=((fn+fp)/total)*100
2 mis_classify

In [ ]:  1 specificity=tn/(tn+fp)
2 specificity

In [ ]:  1 sensitivity=tp/(tp+fn)
2 sensitivity

Build and plot a decision tree with maximum 5 terminal nodes.

In [13]:  1 dt_2=DecisionTreeClassifier(max_leaf_nodes=5,random_state=50).fit(X
2 tree.plot_tree(dt_2,class_names=['No','Yes'],feature_names=X_train
3 plt.show()

Build 5 decision trees each with 20 random features. Also predict the
attrition for test set for each model.

In [14]:  1 columns=list(X_train.columns)
2 sample_features=random.choices(columns,k=20)
3 dt_model_1=DecisionTreeClassifier(random_state=20).fit(X_train[sam
4 y_pred_1=dt_model_1.predict(X_test[sample_features])

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 9/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [16]:  1 sample_features=random.choices(columns,k=20)
2 dt_model_2=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_2=dt_model_2.predict(X_test[sample_features])

In [17]:  1 sample_features=random.choices(columns,k=20)
2 dt_model_3=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_3=dt_model_3.predict(X_test[sample_features])

In [18]:  1 sample_features=random.choices(columns,k=20)
2 dt_model_4=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_4=dt_model_4.predict(X_test[sample_features])

In [19]:  1 sample_features=random.choices(columns,k=20)
2 dt_model_5=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_5=dt_model_5.predict(X_test[sample_features])

Create a new dataframe "model_predictions_df" by appending each prediction made in the


previous question. There will be 5 columns in the dataframe for each prediction using the
decision tree models built in above question.

In [20]:  1 model_predictions_df=pd.DataFrame({'y_pred_1':y_pred_1,'y_pred_2':y
2 model_predictions_df.head()

Out[20]:
y_pred_1 y_pred_2 y_pred_3 y_pred_4 y_pred_5

0 0 0 0 0 0

1 0 0 0 1 0

2 1 1 1 1 1

3 0 0 1 0 0

4 1 0 0 0 1

Create a new column "Voted_Result" in the dataframe


"model_predictions_df" that contains the maximum occuring value (mode)
of the 5 columns in the dataframe (row-wise).

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 10/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [21]:  1 votes=[]
2 for i in range(model_predictions_df.shape[0]):
3 votes.append(model_predictions_df.iloc[i].value_counts().index
4 model_predictions_df['Voted_Results']=votes
5 model_predictions_df.head()

Out[21]:
y_pred_1 y_pred_2 y_pred_3 y_pred_4 y_pred_5 Voted_Results

0 0 0 0 0 0 0

1 0 0 0 1 0 0

2 1 1 1 1 1 1

3 0 0 1 0 0 0

4 1 0 0 0 1 0

Consider the values of "Voted_Result" as our new predictions and store its
values in a variable "new_y_pred" and find the accuracy and the roc-auc
score using new_y_pred.

In [22]:  1 new_y_pred=model_predictions_df['Voted_Results']
2 print("Accuracy:",accuracy_score(y_test,new_y_pred))
3 print("ROC-AUC Score:",roc_auc_score(y_test,new_y_pred))

Accuracy: 0.9485458612975392
ROC-AUC Score: 0.9548693830608723

2. Random Forest

Build a random forest full model to predict if an employee will leave the
company or not and generate a classification report.

In [24]:  1 rf_model=RandomForestClassifier(n_estimators=10,random_state=50).f
2 y_pred=rf_model.predict(X_test)
3 print(classification_report(y_test,y_pred))

precision recall f1-score support

0 0.98 0.92 0.94 259


1 0.89 0.97 0.93 188

accuracy 0.94 447


macro avg 0.93 0.94 0.94 447
weighted avg 0.94 0.94 0.94 447

In [ ]:  1 ​

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 11/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 12/12

You might also like