Fake Job Recruitment Detection Using Machine Learning Approach
Fake Job Recruitment Detection Using Machine Learning Approach
net/publication/341325717
CITATIONS READS
0 2,220
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Shawni Dutta on 31 May 2020.
Abstract — To avoid fraudulent post for job in the A. Single Classifier based Prediction-
internet, an automated tool using machine learning Classifiers are trained for predicting the unknown
based classification techniques is proposed in the test cases. The following classifiers are used while
paper. Different classifiers are used for checking detecting fake job posts-
fraudulent post in the web and the results of those
classifiers are compared for identifying the best a) Naive Bayes Classifier-
employment scam detection model. It helps in
detecting fake job posts from an enormous number of The Naive Bayes classifier [2] is a supervised
posts. Two major types of classifiers, such as single classification tool that exploits the concept of Bayes
classifier and ensemble classifiers are considered for Theorem [3] of Conditional Probability. The decision
fraudulent job posts detection. However, made by this classifier is quite effective in practice
experimental results indicate that ensemble even if its probability estimates are inaccurate. This
classifiers are the best classification to detect scams classifier obtains a very promising result in the
over the single classifiers. following scenario- when the features are
independent or features are completely functionally
Keywords — Fake Job, Online Recruitment, dependent. The accuracy of this classifier is not
Machine Learning, Ensemble Approach. related to feature dependencies rather than it is the
amount of information loss of the class due to the
I. INTRODUCTION independence assumption is needed to predict the
Employment scam is one of the serious issues in accuracy [2].
recent times addressed in the domain of Online
Recruitment Frauds (ORF) [1]. In recent days, many b) Multi-Layer Perceptron Classifier-
companies prefer to post their vacancies online so
that these can be accessed easily and timely by the Multi-layer perceptron [4] can be used as
job-seekers. However, this intention may be one type supervised classification tool by incorporating
of scam by the fraud people because they offer optimized training parameters. For a given problem,
employment to job-seekers in terms of taking money the number of hidden layers in a multilayer
from them. Fraudulent job advertisements can be perceptron and the number of nodes in each layer can
posted against a reputed company for violating their differ. The decision of choosing the parameters
credibility. These fraudulent job post detection draws depends on the training data and the network
a good attention for obtaining an automated tool for architecture [4].
identifying fake jobs and reporting them to people for
avoiding application for such jobs. c) K-nearest Neighbor Classifier-
For this purpose, machine learning approach is
applied which employs several classification K-Nearest Neighbour Classifiers [5], often known
algorithms for recognizing fake posts. In this case, a as lazy learners, identifies objects based on closest
classification tool isolates fake job posts from a proximity of training examples in the feature space.
larger set of job advertisements and alerts the user. The classifier considers k number of objects as the
To address the problem of identifying scams on job nearest object while determining the class. The main
posting, supervised learning algorithm as challenge of this classification technique relies on
classification techniques are considered initially. A choosing the appropriate value of k [5].
classifier maps input variable to target classes by
considering training data. Classifiers addressed in the d) Decision Tree Classifier-
paper for identifying fake job posts from the others
are described briefly. These classifiers based A Decision Tree (DT) [6] is a classifier that
prediction may be broadly categorized into -Single exemplifies the use of tree-like structure. It gains
Classifier based Prediction and Ensemble Classifiers knowledge on classification. Each target class is
based Prediction. denoted as a leaf node of DT and non-leaf nodes of
DT are used as a decision node that indicates certain B. Email Spam Detection-
test. The outcomes of those tests are identified by Unwanted bulk mails, belong to the category of
either of the branches of that decision node. Starting spam emails, often arrive to user mailbox. This may
from the beginning at the root this tree are going lead to unavoidable storage crisis as well as
through it until a leaf node is reached. It is the way of bandwidth consumption. To eradicate this problem,
obtaining classification result from a decision tree [6]. Gmail, Yahoo mail and Outlook service providers
Decision tree learning is an approach that has been incorporate spam filters using Neural Networks.
applied to spam filtering. This can be useful for While addressing the problem of email spam
forecasting the goal based on some criterion by detection, content based filtering, case based filtering,
implementing and training this model [7]. heuristic based filtering, memory or instance based
filtering, adaptive spam filtering approaches are taken
B. Ensemble Approach based Classifiers- into consideration [7].
removal. This prepares the dataset to be transformed not be sufficient enough. Adjustment of these
into categorical encoding in order to obtain a feature parameters enhances the reliability of this model
vector. This feature vectors are fitted to several which may be regarded as the optimised one for
classifiers. The following diagram Fig. 2 depicts a identifying as well as isolating the fake job posts
description of the working paradigm of a classifier from the job seekers.
for prediction. This framework utilised MLP classifier as a
collection of 5 hidden layers of size 128, 64, 32, 16
and 8 respectively. The K-NN classifier gives a
promising result for the value k=5 considering all the
evaluating metric. On the other hand, ensemble
classifiers, such as, Random Forest, AdaBoost and
Gradient Boost classifiers are built based on 500
numbers of estimators on which the boosting is
terminated. After constructing these classification
models, training data are fitted into it. Later the
testing dataset are used for prediction purpose. After
the prediction is done, performance of the classifiers
are evaluated based on the predicted value and the
actual value.
Fig. 2.Detailed description for working of Classifiers
B. Performance Evaluation Metrics
While evaluating performance skill of a model, it
is necessary to employ some metrics to justify the
evaluation. For this purpose, following metrics are
taken into consideration in order to identify the best
relevant problem-solving approach. Accuracy [14] is
a metric that identifies the ratio of true predictions
over the total number of instances considered.
However, the accuracy may not be enough metric for
evaluating model‘s performance since it does not
consider wrong predicted cases. If a fake post is
treated as a true one, it creates a significant problem.
Hence, it is necessary to consider false positive and
false negative cases that compensate to
misclassification. For measuring this compensation,
precision and recall is quite necessary to be
Fig. 3. Classification models used in this considered [7].
framework
Precision [14] identifies the ratio of correct
As depicted in Fig. 3, a couple of classifiers are positive results over the number of positive results
employed such as Naive Bayes Classifier, Decision predicted by the classifier. Recall [14] denotes the
Tree Classifier, Multi-Layer Perceptron Classifier, K- number of correct positive results divided by the
nearest Neighbor Classifier, AdaBoost Classifier, number of all relevant samples. F1-Score or F-
Gradient Boost Classifier and Random Tree measure [14] is a parameter that is concerned for both
Classifier for classifying job post as fake. It is to be recall and precision and it is calculated as the
noted that the attribute ‗fraudulent‘ of the dataset is harmonic mean of precision and recall [14]. Apart
kept as target class for classification purpose. At first, from all these measure, Cohen-Kappa Score [15] is
the classifiers are trained using the 80% of the entire also considered to be as an evaluating metric in this
dataset and later 20% of the entire dataset is used for paper. This metric is a statistical measure that finds
the prediction purpose. The performance measure out inter-rate agreement for qualitative items for
metrics such as Accuracy, F-measure, and Cohen- classification problem. Mean Squared Error (MSE)
Kappa score are used for evaluating the prediction for [14] is another evaluating metric that measures
each of these classifiers. Finally, the classifier that absolute differences between the prediction and
has the best performance with respect to all the actual observation of the test samples. Lower value of
metrics is chosen as the best candidate model. MSE and higher values of accuracy, F1-Score, and
Cohen-kappa score signifies a better performing
A. Implementation of Classifiers model.
In this framework classifiers are trained using
appropriate parameters. For maximizing the
performance of these models, default parameters may
V. CONCLUSIONS
Employment scam detection will guide job-seekers
Fig. 6. Comparison of Cohen-Kappa Score for all to get only legitimate offers from companies. For
specified supervised machine learning model tackling employment scam detection, several
machine learning algorithms are proposed as
countermeasures in this paper. Supervised
mechanism is used to exemplify the use of several
classifiers for employment scam detection.
MSE Experimental results indicate that Random Forest
classifier outperforms over its peer classification tool.
0.55
0.5 The proposed approach achieved accuracy 98.27%
0.45 which is much higher than the existing methods.
0.4
0.35 REFERENCES
0.3
0.25 [1] B. Alghamdi and F. Alharby, ―An Intelligent Model for
0.2 Online Recruitment Fraud Detection,” J. Inf. Secur., vol. 10,
0.15 no. 03, pp. 155–176, 2019, doi: 10.4236/jis.2019.103009.
0.1 [2] I. Rish, ―An Empirical Study of the Naïve Bayes Classifier
0.05 An empirical study of the naive Bayes classifier,‖ no.
0 January 2001, pp. 41–46, 2014.
[3] D. E. Walters, ―Bayes’s Theorem and the Analysis of
Binomial Random Variables,‖ Biometrical J., vol. 30, no. 7,
pp. 817–825, 1988, doi: 10.1002/bimj.4710300710.
[4] F. Murtagh, ―Multilayer perceptrons for classification and
regression,‖ Neurocomputing, vol. 2, no. 5–6, pp. 183–197,
1991, doi: 10.1016/0925-2312(91)90023-5.
[5] P. Cunningham and S. J. Delany, ―K -Nearest Neighbour
Classifiers,‖ Mult. Classif. Syst., no. May, pp. 1–17, 2007,
doi: 10.1016/S0031-3203(00)00099-6.
[6] H. Sharma and S. Kumar, ―A Survey on Decision Tree
Algorithms of Classification in Data Mining,‖ Int. J. Sci.
Res., vol. 5, no. 4, pp. 2094–2097, 2016, doi:
Fig. 7. Comparison of MSE for all specified supervised 10.21275/v5i4.nov162954.
machine learning model [7] E. G. Dada, J. S. Bassi, H. Chiroma, S. M. Abdulhamid, A.
O. Adetunmbi, and O. E. Ajibuwa, “Machine learning for
From Table 1, it is quite clear that Decision email spam filtering: review, approaches and open research
problems,‖ Heliyon, vol. 5, no. 6, 2019, doi:
Tree Classifier gives promising result over Naïve 10.1016/j.heliyon.2019.e01802.
Bayes Classifier, Multi-Layer Perceptron Classifier, [8] L. Breiman, ―ST4_Method_Random_Forest,‖ Mach. Learn.,
K-Nearest Neighbor Classifier. Hence, Decision Tree vol. 45, no. 1, pp. 5–32, 2001, doi:
Classifier can be fruitful predictor as a single 10.1017/CBO9781107415324.004.
[9] B. Biggio, I. Corona, G. Fumera, G. Giacinto, and F. Roli,
classifier. Now, it is checked whether the use of ―Bagging classifiers for fighting poisoning attacks in
ensemble approach enhances the performance of the adversarial classification tasks,” Lect. Notes Comput. Sci.
model or not. For that reason, Random Tree (including Subser. Lect. Notes Artif. Intell. Lect. Notes
classifiers, AdaBoost classifiers and Gradient Boost Bioinformatics), vol. 6713 LNCS, pp. 350–359, 2011, doi:
10.1007/978-3-642-21557-5_37.