Research Paper (1)
Research Paper (1)
1
IV. DATA PRE-PROCESSING more about it. It required converting the data into a
structure that would make further research easier.
A).LIBRARIES USED:
This is usually the stage in the analytics lifecycle that
1) Libraries for Import: We take into consideration
requires the most effort and iterations [1]. The
the following potent and useful tools for the analysis
following are the main processes taken for data
and prediction of attrition rate. The libraries are:
preparation:
a) Numpy: It ranks among the most significant
a) Feature Reduction: This phase was important in
Python tools for computational mathematics and
deciding which features in the dataset should be kept
science.
and which features should be transformed or
b) Pandas: A tool made for quick and simple data removed in order to make decisions about which
frame processing. attributes in the data will be helpful for analysis in
the later stages. The decision as to which trait is
c) Matplotlib: A Python package that produces important and which is not for attrition forecast was
complex graphs and charts like bar charts, pie charts, made. Following are some examples of
and more. characteristics based on which elements were
d) Scikit-Learn: The SciKit-Learn package provides excluded from further analysis:
a variety of supervised and unsupervised machine i) Attributes with numbers that are not unique: There
learning methods. The main goal of machine learning are non-unique numbers for the following attributes:
tools is data modeling.
The number of the property "Employee count,"
2) Read Dataset: Read the dataset of .csv format which is "1" for each employee, is given by the
using pandas function read_csv(). employee count attribute.
3) Create dataset as Data Frame: Now create data "Standard working hours" is an attribute that
frame using read dataset object. This data frame will provides the number of an employee's standard
be used in further pre-processing steps. working hours, which is "80" for each entry. "Over
B) DATA PREPROCESSING 18 yrs of age" is an attribute that confirms whether
an employee meets the age requirement (to be over
Pre-processing means cleaning data, normalizing
18), which is "Yes" for every entry.
datasets and operate the changes in the data. These
steps are performed to get the datasets into a state that All the above mentioned attributes having only
enables analysis in further phases [1]. In Data one unique value .So, we are ignoring these attributes
preprocessing the following steps were performed: from dataset.
1) Investigate Dataset Properties: The goal of data ii) Data cleaning: To guarantee better data quality,
research was to comprehend the connections abnormalities, usually missing values, duplicate data,
between the factors and to examine the issue at hand and outliers are removed. In our dataset there are no
[1]. This research step is useful for spotting common missing values and outliers.
dataset problems like Null values, Outliers,
Redundancies, etc. Below figure 1 picture depicts the iii) Categorical to Numerical: Since categorical
columns of dataset and its datatype. variables are not accepted as input by normal
libraries, these values must be transformed into
numeric form. As seen in Figure 2, this was
accomplished by using the Label Encoder Method to
convert nominal category variables or categorical
data into numerical labels. The range of a numerical
identifier is always 0 to n_classes-1.
. Attrition Attrition
FIG: 1 ATTRIBUTES AND DATA TYPES Yes 1
2) Data preparation: This includes the procedures No 0
for exploring, pre-processing, and configuring data Yes 1
before data modeling. It was carried out in order to
familiarize ourselves with our information and learn FIG: 2 ATTRITION ATTRIBUTE LABEL ENCODING
2
iv) Dataset balancing: In given dataset, there are 2) Attrition on basis of gender:
more records with the label "Attrition" set to "0" than
there are records with the label "Attrition" set to "1,"
causing an unbalance. Figure 3 is a bar graph that
displays the number of labels in the collection for
each label.
3
C) Machine Learning Models for prediction: also find out the accuracy score by
(TP+TN)/(TN+TP+FN+FP)[10][11]
After preparing the data, the next step in using
machine learning models for prediction involves an Among all classification algorithms, we
loop process that aims to improve the accuracy of the observe that Random Forest algorithm giving best
models. There are several classification models that accuracy score and also predicting accurately on
can be used for this purpose: unknown data.
In this project we train model using Random Forest Figure 7, shows how currectly the random
algorithm , logistic regression[6], SVM and Decision forest algorithm predict the target class.
tree.In these algorithms random forest gives best
accuracy when compared to other classification
models. VI. CONCLUSION
D) Result Analysis: After assessing the execution of four classification
models, a significant finding was that if feature
When evaluating a machine learning model, it is reduction for prediction is appropriately conducted,
important to use appropriate metrics to measure its the accuracy rate of the classification models always
performance. Three commonly used metrics in be better compared to classification with feature
machine learning are: selection. In particular, the Random Forest classifier
with feature reduction achieved an accuracy score of
1).Accuracy: This metric used to measure the
85.3%, while the Decision tree classifier achieved
proportion of correct predictions made by the model
83%. Random Forest model giving best classification
over the total number of predictions. It is calculated
for True positives and True negatives data. The
by dividing the number of correct predictions by the
methods described in the paper for analyzing and
total number of predictions made.
categorizing data can form a basis for improving
2) Confusion Matrix: It is a matrix representation of data-driven decision-making processes. These
TP,TN,FP and FN values. Using this matrix we can techniques can unlock new insights from data and
4
help organizations improve their operations.
Implementation of these methods can also contribute
to a positive work culture and improve an
organization's reputation in their respective industry.
REFERENCES
[1] Srivastava, Devesh Kumar, and Priyanka Nair. "Employee
attrition analysis using predictive techniques." International
Conference on Information and Communication
Technology for Intelligent Systems. Springer, Cham, 2017.
[2] S. S. Gavankar and S. D. Sawarkar, "Eager decision tree,"
2017 2nd International Conference for Convergence in
Technology (I2CT), Mumbai, 2017, pp. 837-840.
[3] Safavian, S.R. Landgrebe. D, “A survey of decision tree
classifier methodology”, IEEE Transactions on Systems,
Man, And Cybernetics, Vol. 21, No. 3, May-June 1991.
[4] Shmilovici A. (2009) Support Vector Machines. In: Maimon
O., Rokach L. (eds) Data Mining and Knowledge Discovery
Handbook. Springer, Boston.
[5] Setiawan, I., et al. "HR analytics: Employee attrition analysis
using logistic regression." IOP Conference Series: Materials
Science and Engineering. Vol. 830. No. 3. IOP Publishing,
2020
[6] Schober, Patrick MD, PhD, MMedStat*; Vetter, Thomas R.
MD, MPH†. Logistic Regression in Medical Research.
Anesthesia & Analgesia 132(2):p 365-366, February 2021.
| DOI: 10.1213/ANE.0000000000005247
[7] Jayalekshmi J, Tessy Mathew, “Facial Expression
Recognition and Emotion Classification System for
Sentiment Analysis”, 2017 5 Authorized licensed use
limited to: University College London. Downloaded on
May 23,2020 at 00:07:22 UTC from IEEE Xplore.
Restrictions apply. International Conference on Networks &
Advances in Computational Technologies (NetACT) |20-22
July 2017| Trivandrum.
[8] Isabelle Guyon, Andre Elisseeff, “An Introduction to Variable
and Feature Selection”, Journal of Machine Learning
Research 3 (2003) 1157-1182.
[9] Ilan Reinstein, “Random Forest(r), Explained”,
kdnuggets.com, October
2017[Online].Available:https://www.kdnuggets.com/2017/
10/randomforests-explained.html
[10] http://scikitlearn.org/stable/modules/generated/sklearn.metri
cs.confusion_matrix.html
[11] http://scikitlearn.org/stable/auto_examples/model_selection/
plot_confusion_matrix.ht ml
[12] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W.
Philip Kegelmeyer, “SMOTE: Synthetic Minority Over-
sampling Technique”, Journal of Artificial Intelligence
Research 16 (2002), 321 – 357
[13] Pavan Subhash, “IBM HR Analytics Employee Attrition &
Performance”,
www.kaggle.com,2016[Online].Available:https://www.kag
gle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
[14] Sperandei S. Understanding logistic regression analysis.
Biochem Med (Zagreb). 2014 Feb 15;24(1):12-8. doi:
10.11613/BM.2014.003. PMID: 24627710; PMCID:
PMC3936971.