Predicting Employee Attrition using Machine Learning Techniques
Predicting Employee Attrition using Machine Learning Techniques
Abstract: For businesses employee retention is a major issue, and forecasting attrition can assist HR departments to put in
place proactive measures to lower turnover. Using methods including Random Forest, XGBoost, Decision Tree, Support
Vector Classifier (SVC), Logistic Regression, KNearest Neighbors (KNN), and Naive Bayes, this project uses machine
learning approaches to study important factors affecting employee departure. The model discovers trends in job satisfaction,
workload, career development, and worklife balance trained on the IBM Analytics dataset with 35 characteristics and 1,500
records. Deployed as an interactive Flask based web application, the system includes capabilities for data upload,
forecasting, and model performance visualization. This AI driven solution helps HR staff to find early at-risk employees,
manage issues efficiently, and enhance staff stability by offering practical insights. By using predictive analytics in HR
management, businesses can lower attrition expenses, improve staff engagement, and create a more resilient setting.
Keywords: Employee Attrition Prediction, Machine Learning, Random Forest, XGBoost, Decision Tree, Support Vector Classifier
(SVC), Logistic Regression, K-Nearest Neighbors (KNN), Naive Bayes, Flask.
How to Cite: N. Bhavana; Chukka Ganesh. (2025). Predicting Employee Attrition using Machine Learning Techniques.
International Journal of Innovative Science and Research Technology,
10(5), 1-10. https://doi.org/10.38124/ijisrt/25may172.
IJISRT25MAY172 www.ijisrt.com 1
Volume 10, Issue 5, May – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25may172
the risks toward loss of workforce stability in the respect of income and wages. Large attrition types are a draw
organization. to organizations, forcing organizations to lay off more
employees. As such, if the model works, it minimizes the cost
The system utilizes a range of machine learning of retaining active employees: the higher the turnover, the
methodologies to generate accurate employee turnover higher overall costs. Trying to improve this model will only
predictions based on HR data. It is designed to recognize yield cost savings by avoiding retaining employees.
patterns that may not be immediately visible, enhancing the
ability to detect early signs of potential attrition. The A user-friendly web application is developed using
prediction process involves training on historical employee HTML, CSS, and JavaScript, allowing HR professionals to
records to identify signals indicative of future resignations. enter employee data and receive real-time predictions. The
By leveraging multiple classification techniques, the system platform is designed to offer insights into workforce retention
improves its predictive capability, allowing it to accurately trends, empowering organizations to take data-driven,
identify employees who are most likely to leave. This proactive measures to reduce turnover. The system also
approach supports proactive decision-making and helps enables companies to refine their HR policies, improve
organizations implement effective retention strategies. employee engagement, and implement targeted retention
strategies.
Since employee turnover is today almost an expected
precariousness, the human resources part needs more By integrating machine learning into HR analytics, this
thorough attention. Employees leave for or migrate to other system provides valuable insights for businesses, supporting
firms seeking opportunities or better terms. It is stated that early identification of attrition risks and enabling timely
keeping some current issues in mind, a new predictive interventions. It helps organizations make informed decisions
approach to attrition has been suggested based on a folkloric regarding employee retention, ultimately leading to a more
inference of views based on what an average human is like in stable, productive, and engaged workforce.
IJISRT25MAY172 www.ijisrt.com 2
Volume 10, Issue 5, May – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25may172
III. MODULES AND ITS IMPLEMENTATION Model Page: This section displays the accuracy of each
machine learning model used in the system. Users can
A. System Operations compare different models.
Prediction Page: After uploading data, users can navigate
Upload Data: HR professionals collect and upload a to the prediction page, where they can view individual and
structured dataset containing various employee-related overall attrition predictions.
factors that influence attrition. This dataset includes Viewing Results: Once the data is processed, users can
details such as job roles, work experience, compensation, view the classification results, including whether an
performance metrics, work-life balance indicators, and employee is at risk of leaving.
employee engagement levels. By gathering Logout: To ensure data security and privacy, users can
comprehensive data, the system can better analyze log out of the system after completing their tasks, securing
workforce trends and predict employee attrition their session and personal data.
accurately.
Data Preprocessing: Once the dataset is uploaded, it IV. MODELING AND ANALYSIS
undergoes a series of data cleaning and preprocessing
procedures. This involves handling missing or corrupted A. Random Forest
data, encoding categorical variables such as department It is ensemble algorithms that build many decision trees
and job role, normalizing or standardizing numerical and prop them against random subsets of IBM Analytics
features like salary and experience, and applying consisting of 35 features (like job satisfaction, workload, and
techniques to balance the dataset to prevent bias. These more). The outcomes are then aggregated through voting to
preprocessing steps ensure that the machine learning classify the overall results so this reduces chances of
models perform optimally and provide accurate overfitting while increasing accuracy. Using these feature
predictions. importance scores, Random Forest here helps in inferring the
Model Building: The system trains multiple machine more salient attrition factors like work-life balance. Also, due
learning models to classify employees as likely to stay or to its robustness against noisy data and handling of
leave based on historical HR data. These models are fine- imbalanced classes, this classifier fits aptly for predicting the
tuned through hyperparameter optimization to improve high-at-risk employees within the organization
prediction accuracy and overall reliability. Model
selection is guided by evaluating performance metrics B. XGBoost
such as accuracy, precision, recall, and F1 score, ensuring It is an advanced ensemble algorithm that sequentially
that the most effective model is chosen for employee builds decision trees, optimizing a loss function with gradient
turnover prediction boosting. For attrition prediction, it processes features like
Model Prediction: The trained models analyze new career growth and workload, weighting errors to improve
employee data using the same preprocessing techniques accuracy. Its regularization prevents overfitting, and
applied to the training data. Based on this analysis, the scalability handles the 1,500-record dataset efficiently. In the
system predicts whether an employee is at risk of leaving Flask app, XGBoost’s high predictive power aids early
the organization. The model's decision is based on identification of at-risk employees.
historical trends and key influencing factors identified in
the dataset. C. Decision Tree
Result: The system presents the prediction results for It split the IBM dataset into branches based on features
each employee, along with confidence scores to indicate like job satisfaction or work-life balance, creating a
prediction reliability. Additionally, it provides detailed flowchart-like model to predict attrition. Each node
performance metrics, including confusion matrices, represents a decision, and leaves indicate outcomes
accuracy, precision, recall, and F1 score. Visual aids such (stay/leave). Their simplicity and interpretability help HR
as bar charts, histograms, and ROC curves are also visualize attrition patterns. In the project, Decision Trees
incorporated to help HR professionals interpret the results provide clear rules for identifying at-risk employees.
more effectively. However, they are prone to overfitting, especially with noisy
data, leading to poor generalization. Pruning and limiting tree
B. User Operations depth mitigate
Register: Users, primarily HR professionals, must first D. Support Vector Classifier (SVC)
register with their credentials to create an account in the It finds the optimal hyperplane to separate employees
system. who stay from those who leave, maximizing the margin
Login: Registered users can log in using their credentials between classes. For non-linear patterns in the dataset (e.g.,
to securely access the system and perform data analysis. complex interactions between career growth and workload),
Upload Data: Users can upload employee datasets SVC uses kernels like RBF. In the project, SVC effectively
containing relevant information about job satisfaction, classifies attrition risk but struggles with the dataset’s size
salary, experience, and performance. The uploaded data due to high computational costs. Scaling features is essential
should be in a structured format compatible with the for performance.
system.
IJISRT25MAY172 www.ijisrt.com 3
Volume 10, Issue 5, May – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25may172
E. Logistic Regression provides fast predictions, ideal for real-time HR use.
It predicts the probability of employee attrition by However, its independence assumption may oversimplify
modeling the relationship between features (e.g., job complex relationships, reducing accuracy. It performs well
satisfaction, work-life balance) and a binary outcome with imbalanced classes, common in attrition data. Its
(stay/leave) using a logistic function. Its simplicity and simplicity aids deployment but limits capturing intricate
interpretability make it ideal for HR to understand feature patterns. Naive Bayes supports HR by offering quick,
impacts via coefficients. In the project, it handles the 1,500- interpretable insights for early intervention.
record dataset efficiently, providing baseline predictions for
the Flask app. However, it assumes linear relationships, V. RESULTS AND DISCUSSION
which may miss complex patterns. Regularization (e.g., L1,
L2) prevents overfitting. Its fast training and deployment The Employee Attrition Prediction System could very
make it practical for real-time attrition risk assessment, well prove helpful in predicting employees who are very
supporting proactive HR interventions. likely to leave the company through the use of a variety of
machine learning models, which ranges from distance-based
F. K-Nearest Neighbors (KNN) classifiers to margin-based classifiers, tree-based methods,
it employees as likely to remain or likely to leave by and even neural networks. The predicted reliabilities shall be
considering the ‘k’ employees most similar to the ones in the lowered by implementing ensemble learning techniques like
dataset in question with respect to their workload, career Voting Classifier and Stacking Classifier to combine
growth, importance of training, and so forth. It uses distance individual merits to achieve better performance results.
metrics like Euclidean distance, Manhattan distance, and so
on. KNN, therefore, captures the local patterns in attrition A study with IBM HR Analytics Employee Attrition
data, but it is sensitive to feature scaling and noise. The other Data showed that job satisfaction, work-life balance,
aspect is the computational cost, which increases with the size compensation, and career development opportunities affected
of the dataset and, therefore, affects the performance of the employee retention critically. Similarly, these results were
Flask app. Careful consideration of choosing ‘k’ is from earlier researches and showed the sweaty and
imperative. KNN is slow and not storage-efficient, thus multidimensional issues of turnover.
limiting its scalability. It provides assistance to HR in
identifying employees vulnerable to leaving by offering This makes it possible to design and deploy a simple
insights through similarity detection. user-friendly web application hosted on Flask, where all HR
personnel can feed staff data and predictions in real-time. It
G. Naive Bayes also allows ad hoc capture of employees at risk of leaving and
It predicts attrition by calculating probabilities of incentivized interventions. These outputs can also provide
staying or leaving based on features, assuming independence performance visualization and other services from the model,
between them (e.g., job satisfaction and workload). Using thus contextualizing model results, enabling data-informed
Bayes’ theorem, it’s computationally efficient and excels decisions.
with categorical data in the IBM dataset. In the Flask app, it
A. KNN Classifier
IJISRT25MAY172 www.ijisrt.com 4
Volume 10, Issue 5, May – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25may172
Table 1: Classification Report of KNN
The KNN model's AUC value of 0.87 on the ROC curve significant falses positives. Conversely, Class 0 refers to no-
indicates a good ability for distinguishing between employees attrition with exceptionally high precision of 0.91 and lower
prone to leave and those who are likely to stay. The recall of 0.72, meaning that it missed some no-attrition
classification report further states that the model has 80% instances. Hence, with macro average and weighted average
overall accuracy. Class 1 indicates attrition with a very good having a fixed score of 0.80, it shows close performance of
recall of 0.91, thus correctly identifying employees at risk of the classifier for both classes.
leaving, though with low precision of merely 0.72, implying
B. SVM Classifier
IJISRT25MAY172 www.ijisrt.com 5
Volume 10, Issue 5, May – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25may172
AUC 0.87 tells us that the model is working quite well identifying employees at risk of leaving. Class 0, instead, has
and can pick out whether an employee is going to leave (class better performance with a precision of 0.82 and lower risks of
1) or stay (class 0) with strong discrimination through the even less false positives in employee prediction who will
interpretation of the ROC curve for SVC (Linear Kernel). The stay. The two classes now have an equal F1-score of 0.79,
classification report also states that the model has 79% which indicates how good the trade-off between precision
accuracy, and class precision for attrition class 1 is at 0.76 and recall is. Thus this clearly leads to the reliability of the
with a recall of 0.77, indicating that it does pretty well model in predicting employee attrition.
The model has shown good performance under this full credit to the model's capacity in distinguishing between
particular setup of the Decision Tree: AUC of the curve at the two classes. In contrast, the recall values appear strong
0.99, which stands for a near-perfect performance throughout too, being 0.97 for employees staying and 0.93 for employees
the year, in telling apart the employees who are probably leaving; hence, it probably captured the majority of true
going to stay versus those who will likely leave. The positives in each class. The F1-score value for both classes is
subsequent classification report credits 95% total accuracy in also high at 0.95, thereby representing an optimal balance
performances; however, it must be mentioned that on between precision and recall, making this model very robust
precision, the classes diverge precision values (0.93 for in terms of its prediction on employee attrition.
staying employees and 0.97 for leaving employees), giving
IJISRT25MAY172 www.ijisrt.com 6
Volume 10, Issue 5, May – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25may172
D. Random Forest Classifier
The model gives an AUC of 1.00 on the ROC curve, calculated to 0.98 and 0.99, respectively. The F1-scores for
meaning it performs excellently by being able to discriminate both classes are very close to 0.99, showing an excellent
positively between employees whose class is likely to be balance between recall and precision. High scores through all
Class 1 (leaving) and whose class is likely to be Class 0 the metrics highlight the strength of the model in predicting
(staying). From the classification report, the model's accuracy employee attrition, which can be used as a good tool in
is impressive at 98%. Both precision and recall for Class 0 identifying at-risk employees and reducing turnover.
were at 0.99 while the precision and recall for Class 1
E. Logistic Regression
IJISRT25MAY172 www.ijisrt.com 7
Volume 10, Issue 5, May – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25may172
Table 5: Classification Report of Logistic Regression
The Logistic Regression model, as depicted in an AUC at 0.79, meaning the model can identify employees prone to
score of 0.83 on the ROC curve, is moderately performing quitting fairly well but has room for improvement regarding
and indicates that the model has some level of effectiveness false positives. Precision for class 0 (i.e. no attrition) is 0.77
in classifying eventually leaving employees (class 1) as and recall is 0.74, indicating slightly better performance at
compared to the staying employees, class 0. However, it does predicting employees likely to remain. The resulting F1 of
not match that quality of performance demonstrated by other 0.77 means that the model does quite well on both classes
models bearing higher AUC values like Random Forest or such that the entire model can be favored as reasonably good
Decision Tree. The classification report shows an accuracy of at predicting attrition, but not into the same category of highly
77% and precision for class 1 (attrition) equals 0.76 and recall reliable models like Random Forest or Decision Tree.
F. Naïve Bayes
IJISRT25MAY172 www.ijisrt.com 8
Volume 10, Issue 5, May – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25may172
Modeling with Naive Bayes has moderate performance employees likely to leave than keeping low errors in false
as shown in ROC, which gives an AUC of about 0.80. This positives. While the class 0 (no attrition) concession is at 0.74
convention indicates that the model can discriminate fairly and then compared to recall at 0.58 to say that the model finds
well between those employees who will leave and those who it difficulty predicting those employees likely to stay in the
will stay, although this ability is weaker than Random Forest organization. For both the classes, F1 scored 0.69, telling us
or Decision Tree models. Among other things, the that the model has a moderate balance between precision and
classification report mentions accuracy in class1 (attrition) recall yet could improve identifying both employees at risk of
around 70 percent, while precision equals 0.67, and recall is leaving and those likely to stay.
0.81. It can be interpreted as being better at discerning
G. Xgboost
It appears to be very strong with an AUC of 0.94 on the which, hence, stands as a dependable predictive tool for
ROC curve, demonstrating an excellent ability to differentiate employee attrition.
between leaving employees (class 1) and staying employees
(class 0). In the classification report, the model presents an VI. CONCLUSION
accuracy value of 86%, having class 0 (no attrition) precision
at 0.92 and class 0 recall at 0.83, which means the model is The approaches demonstrated here explore machine
good at predicting employees who will stay, and there exists learning's merit of predicting employee attrition using
further improvement to correctly identify all of them. For algorithms like Random Forest, XGBoost, Decision Tree,
class 1 (attrition), precision is at 0.81 and recall is at 0.91, Support Vector Classifier (SVC), Logistic Regression, K-
suggesting the model does very well in predicting employees Nearest Neighbors (KNN), and Naive Bayes. It measures key
at risk of leaving while reasonably maintaining its precision. parameters such as job satisfaction, performance, tenure, and
Both classes hold up a high value for the F1 score, thus demographic details to give sturdy predictions on employee
maintaining a balance across the performance of the model, turnover. This system uses Flask for a web-based approach
and offers an interactive interface into which HR
IJISRT25MAY172 www.ijisrt.com 9
Volume 10, Issue 5, May – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25may172
professionals feed data upload, monitor model performance,
and derive insights. Well-informed and timely, these insights
give the HR team a forward-thinking approach to issues
relating to workloads, job satisfaction, and career growth,
which should affect retention strategy and hence workforce
stability positively. This way, organizations empower
themselves to make informed, data-driven decisions for more
effective human resource management.
REFERENCES
IJISRT25MAY172 www.ijisrt.com 10